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INTRODUCTION 

Increased interest in research methods employing numbers 
has been shown in economics in recent years. More and more 
tables, graphs and formulae are to be found in economic 
publications in textbooks, monographs, articles and studies. 
Naturally the use of formal methematical methods of research 
must always be subordinate to qualitative analysis because of the 
complex character of socio-economic phenomena and pro- 
cesses. Economic research concentrates on the individual 
and social activities of man as an economic agent, on his 
behaviour as a producer or consumer, on his individual and 
social needs, on his customs, psychological reactions, tastes, 
likes and dislikes. Economics is one of the social sciences. 
It studies laws and relationships governing the social activities 
of men, particularly in the processes of production, distribu- 
tion, exchange and consumption. Although in economic studies 
phenomena are primarily analysed in their qualitative aspects, 
it does not follow that their quantitative aspects may be neg- 
lected. Thus, for instance, economists analyse and attempt to 
explain the relationships between wages and the productivity 
of labour, costs and production, demand and price as well 
as personal income, the productivity of labour and the mechan- 
ization and automation of the production processes, national 
income and investment expenditures, etc. In order to provide 
information that is as complete as possible, an analysis of 
these relationships, besides explaining the mechanism, must 
enable us to predict the behaviour of one of the related phenom- 
ena when the behaviour of the others is known. This is 
usually impossible when the phenomena studied are not 

vii 
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measurable and the relationships existing between them cannot 
be presented in functional form. In economic studies it can be 
seen at almost every step how closely interrelated are the 
qualitative and quantitative aspects of phenomena. For this 
very reason a correct method of economic analysis cannot ignore 
the importance of the quantitative approach to the description 
and analysis of phenomena. 

Every day new and more specialized methods of research 
using a more or less complex mathematical apparatus 
are being adapted to economic studies. Usually they are statis- 
tical methods 1 , but because of the sphere of their applica- 
tion they are generally referred to as econometric methods. 
A very great progress has been made recently in econometric 
research and many interesting books dealing with econo- 
metrics have been published. These books contain ample 
evidence that the most valuable statistical methods applicable 
to economic analysis are those belonging to the theory of 
correlation and regression. 

In economic applications of regression theory, linear re- 
gression is of greatest importance. This is for many reasons, 
of which the most important are: 

1) linear regression is a simpler concept than curvilinear 
regression and the calculations involved are much less 
complicated; 

2) linear regression appears most frequently in practice; 
as is well known, the regression lines in a two-dimen- 
sional normal distribution are straight lines; therefore, 
in studying two-dimensional populations we deal with 



1 Seldom mathematical in the narrow sense of this word. An 
example of a mathematical method may be found in analysis of 
inter-branch flows (or input-output analysis), or in programming (linear, 
non-linear, dynamic). 
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linear regression at least as often as with a normal 
distribution. An explanation of why a normal distri- 
bution appears frequently is, in turn, found in the 
Central Limit Theorem; 

3) curvilinear regression can often be replaced by linear 
regression which provides an approximation close enough 
for practical purposes; 

4) curvilinear regression may be reduced to linear regression 
by replacing the curve by linear segments; 

5) linear regression is of particular importance for multi- 
dimensional variables. It is known that the nature of a 
function approximating regression I may be inferred 
from a scatter diagram. When the number of variables 
is greater than three a diagram cannot be drawn and 
common sense indicates that linear regression (being 
the simplest) should be used in such cases. 

This book has been written primarily for scientists in 
economic, agricultural and technical colleges who deal with 
economic problems in their research. It is also addressed to 
graduates of economic and technical colleges employed in 
different branches of the national economy who because 
of their occupation have frequent occasion to use statis- 
tical methods in studying the relationships between phenom- 
ena. To this group belong primarily those engaged in plan- 
ning, statistics, cost accounting, economic analysis, time 
and motion studies, inventory control, and technology. This 
book may also be of some help to students in day and cor- 
respondence courses run by schools of economics and business. 
In order to use it with ease it is necessary to have a basic 
knowledge of calculus and of some elements of the theory 
of probability and mathematical statistics. Since econo- 
mists are usually interested in the humanities and their 
knowledge of mathematics is often rather scanty, the outline 
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of this book is so designed as to facilitate its use by enabling 
the reader to omit the more difficult parts without inter- 
fering with his understanding of the whole exposition. Those 
parts that require a better knowledge of mathematics and statis- 
tics are marked with one asterisk (*) at the beginning and with 
two (**) at the end. They may be omitted without lessening 
the understanding of the main ideas behind the methods 
presented, or the mastering of the computation technique. 
If the more difficult parts are omitted this book is accessible 
even to those whose background in mathematics and statistics 
is quite modest, so that the circle of its readers may be quite 
wide. Even though they will not be able to learn about all 
the more formal aspects of statistical research methods and 
of descriptions of relationships existing among phenomena 
which are presented in this book, they can learn the intuitive 
and computational aspects of these methods. This, of course, 
is most important from the point of view of the wide dissemi- 
nation of the methods presented. 

The book has been divided into 6 chapters. 

Chapter 1 constitutes the background for the whole work. 
It comprises the elementary concepts and the more important 
definitions and theorems concerning two-dimensional and 
multi-dimensional random variables. This chapter also con- 
tains an explanation of the symbols and terms used in the 
book. In Chapter 2 the more important applications of cor- 
relation methods to economics are reviewed. So far, correlation 
methods have rarely been used in economic analysis. The 
review of applications given in Chapter 2 and numerous exam- 
ples quoted in the following chapters will illustrate the use- 
fulness of statistical methods in analysing the relationships 
among random variables. 

In Chapter 3 methods of estimating regression parameters 
are discussed. Chapter 4 deals with methods of testing some 
statistical hypotheses important for practical applications 
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of the correlation analysis. Particularly worth noting are 
non-parametric tests for verifying the hypothesis that the 
two-dimensional population is normal, and non-parametric 
test for verifying the hypothesis that the regression in the popu- 
lation is a linear regression. In Chapter 5 methods of trans- 
formation of curvilinear regression into linear regression are 
discussed. In examples illustrating the computational tech- 
nique of determining regression parameters a new method 
called the two-point method has been used. In the last Chapter 
(6) an attempt has been made at a new approach to the prob- 
lem of trend. It is known that the determination of trend 
parameters, in a formal sense, does not differ from the deter- 
mination of regression parameters. There are, however, 
differences of substance between the trend line and the regression 
line, and, therefore, it is necessary to define the trend line 
in a way different from the definition of the regression line. 
This definition is given in Chapter 6, which is also a concluding 
chapter, in order to emphasize the fact that correlation meth- 
ods can be used not only in static but also in dynamic re- 
search. 

This work deals with two-dimensional variables. Most of 
the results obtained, however, may be generalized and applied 
to multi-dimensional variables. The author has tried to use 
diverse types of statistical data so as to create a broad basis 
for checking the usefulness of the two-point method he pro- 
poses for determining regression line parameters. For this 
reason, the work contains not only the results of the author's 
own research, but also statistical data from the works of other 
authors. Figures in rectangular brackets [ ] denote the numbers 
of the items in the Bibliography. Statistical data used in the 
book are quoted either in the text or in the Appendix at the 
end of the book. 

The work is divided into chapters, sections and items. The 
decimal system is used in denoting them. The first figure 
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denotes the chapter, the second the section, and the third 
the item. Thus, 2.2.3. denotes the third item of the second 
section of the second chapter. 

Formulae, tables and graphs are numbered separately 
for each numbered part of the book. 



1. REGRESSION AND CORRELATION 

/./. General comments on regression and correlation 

It is commonly known that one of the basic elements of the 
learning process is the scientific experiment. However, there 
are sciences in which it is very difficult to experiment, 
especially if the word experiment is understood to mean study- 
ing the object in question under conditions artificially cre- 
ated for this purpose. To this category belong, primarily, 
the social sciences, and among them again in first place 
economics interpreted in the broad sense of the word (i.e. 
not only political economy, but also all the related economic 
disciplines). 

In those sciences in which experimenting is difficult or 
impossible, the process of learning is particularly cumber- 
some. One of the objectives of an experiment is to establish 
a causal relation between the phenomenon studied and other 
phenomena. To achieve this purpose a large number of ex- 
periments have to be carried out and in the process the in- 
fluence of those factors which may be related to the phenom- 
enon studied is gradually eliminated and observations 
regarding the behaviour of the phenomenon are made in 
isolation. 

In this way experimenting may help in recognizing which 
factors exert an essential influence on the behaviour of the 
phenomenon studied, and which affect it slightly, non-essen- 
tially or not at all. If, for any reason, experimenting is impos- 
sible (e.g. when it creates a danger to human life, or is too 
costly, or technically impossible) then the process of learning 

1 
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must take its course under natural conditions. In such cases 
the search for a causal relation between the phenomenon 
studied and its environment is particularly difficult because 
then the question arises how to classify the phenomena into 
those which essentially affect the behaviour of the phenomenon 
studied and those with negligible influence. The answer to 
this question is provided by statistics. The importance of statis- 
tical methods of analysis is of the highest order for those sciences 
in which it is difficult to experiment, even if these sciences 
are as widely separated as demography and quantum physics. 

Suppose that we are interested in two phenomena, A and B. 
We have to find out whether or not one affects the other. 
If both phenomena can be expressed numerically then for 
the description of their mutual relationship we may use the 
mathematical apparatus provided by the theory of function. 
Such sciences as physics and mechanics very often make use 
of mathematical functions to describe the relationship existing 
between two phenomena. The quantities with which these 
sciences deal may be considered as ordinary mathematical 
variables. Thus, for instance, distance travelled is a function 
of time and speed; voltage is a function of current intensity 
and resistance; work is a function of power and distance. 

Besides phenomena which have a relationship so close that 
it may be regarded, for practical purposes, as functional, there 
are others among which the relationship is weak and obscured 
by the impact of many other forces of a secondary nature 
which cannot be [eliminated while making observations. 
This type of relationship occurs when there exists an inter- 
dependence between random variables. We say that two 
random variables are stochastically dependent (as distinct from 
functionally) // a change in one of them causes a change in 
the distribution of the other (see [16], p. 364). 

Example 1. We are interested in the problem of the effect 
of nutrition on the length of human life. The scatter diagram 



Regression and correlation 
GRAPH 1. 



60 



\40 



30 




22 



?0 



1. Australia 

2. Austria 

3. Belgium 

4. Brazil 

5. Bulgaria 

6. Canada 

7. Chile 

8. Czechoslovakia 

9. Denmark 

10. Egypt 

11. Germany 

12. Greece 

13. Honduras 

14. Hungary 

15. Iceland 

16. India 

17. Ireland 

18. Italy 

19. Japan 

20. Mexico 

21. The Netherlands. 

22. New Zealand 

23. Norway 

24. Panama 

25. Poland 

26. Portugal 

27. Siam 

28. Spain 

29. Sweden 

30. Switzerland 

31. United Kingdom 

32. USA 



234567 
consumption (thousand calories) 



8 



Note. The calories in the table above are "vegetable calories" in which 
an allowance is made for a higher value in calories obtained from 

proteins. 

is shown on Graph 1. The x-axis represents average food 
consumption per person, measured in calories, and the 
j-axis expectation of life of man. On the graph we see a 
collection of points. Each point may be regarded as a reali- 
zation (x 9 y) of a two-dimensional random variable (X 9 Y) 
where X denotes the consumption of food and Y the expec- 
tation of life of man. The points are numbered. On the right 
side of the graph there is a list of countries, all numbered for 
easy identification of the points corresponding to particular 
countries. The distribution of the points on the graph shows 
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a clear tendency. It is expressed by a curve drawn among the 
points. This curve is called a regression line. The ordinates of 
the curve give the expectation of life of man in different 
countries corresponding to different values of the average food 
consumption in those countries. It follows that a regression line 
is a functional expression of a stochastic relationship between 
random variables X and Y. If the regression line is a straight 
line we call it a linear regression. Its practical importance 
is of a very high order. When the relationship studied pertains 
not to two, but to a greater number of random variables, 
then instead of a regression line we get a regression plane 
(with three variables) or a regression hyperplane (with four 
or more variables). In such cases we deal with a multi-di- 
mensional regression. 

We have said above that the regression line expresses a 
certain tendency in a distribution of points on a scatter dia- 
gram. Particular points (x, y), as a rule, do not lie on a regression 
line, and are more or less removed from it but the majority 
of points are grouped around this line. The regression line 
expresses a relationship between the random variables Y and X. 
The deviations of particular points from the regression line 
may be considered a result of the influence of a variety of random 
factors. Let us imagine that a study of the interdependence 
between the length of human life and the consumption of 
food may be carried out under conditions ensuring the abso- 
lute elimination of the influence of any other factors besides 
food consumption on the length of life. (In practice, of course, 
this is impossible.) It could be surmised that under such 
circumstances the points on the graph would lie almost exactly 
along a certain curve which would mean that the relationship 
between variables Y and X is functional and not stochastic. 
It might be expected that the shape of such a curve would 
approximate the shape of the regression curve shown on 
<3raph 1. 
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The importance of the regression line as a tool of learning 
consists in the fact that it permits us to relate to any value 
of one variable the expected or most probable value of the other 
variable. This is of particular importance in cases when accu- 
rate observation of the values of one of the variables encounters 
substantial difficulties. Let us suppose, for instance, that we want 
to estimate the amount of timber that can be obtained from 
1,000 hectares of forest. If we know the regression line describ- 
ing the relationship between the amount of timber in a tree 
trunk and the circumference of the trunk measured at a certain 
height, we can easily solve the problem. 

The regression line is a tool of scientific prediction; if wer 
know the value of one variable the regression line allows 
us to estimate the corresponding value of the other variable. 
The less the particular points deviate from the regression 
line, the better and the more accurate will be our estimate, 
because then a stochastic relationship is transformed into a 
functional relationship. The influence of random factors, 
disappears and that of a regular factor is revealed more clearly. 
The bond between the two variables studied becomes stronger. 

The whole group of problems related to measuring the 
strength of the relationship between random variables is the 
subject of the branch of statistics called correlation theory*. 
The measures of correlation most frequently used are: the 
correlation coefficient and the correlation ratio. 

Statistical methods of studying interdependence between 
random variables allow us not only to measure the strength 
of this interdependence but also to verify the hypothesis that 
two variables are correlated with one another. The objective 
of every science is to discover and to explain causal relations 
existing between phenomena. Sometimes such relations are 



1 The word "correlation" in statistics means interdependence between 
random variables. 
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strong and immediately apparent. Often, however, they are 
weak and hidden among many diverse relationships existing 
between the phenomenon studied and the outside world. 
The researcher, on the basis of his scientific analysis, assumes 
the hypothesis that there exists a causal relationship between 
two defined phenomena. It may happen that it is impossible 
to test this hypothesis by a direct experiment. Correlation 
theory has at its disposal methods which allow us, in many 
cases, to verify such hypotheses. 

Example 2. The hypothesis has been postulated that an 
increase in the consumption of animal protein reduces fertility. 
This hypothesis seems to be fairly unexpected. It cannot be 
tested by experimenting. Its verification, however, can be 
carried out on the basis of statistical data contained in Table 1 



TABLE 1 

THE RELATIONSHIP BETWEEN THE BIRTH RATE 
AND CONSUMPTION OF PROTEIN 



7s T o 


Country 


Birth rate 
per thousand 


Daily consumption 
protein in grammes 


of animal 
per person 


1 


Formosa 


45-6 


4-7 




2 


Malaya 


39-7 


7-5 




3 


India 


33-0 


8-7 




4 


Japan 


27-0 


9-7 




5 


Yugoslawia 


25-9 


11-2 




6 


Greece 


23-5 


15-2 




7 


Italy 


23-4 


15-2 




8 


Bulgaria 


22-2 


16-8 




9 


Germany 


20-0 


37-3 




10 


Ireland 


19-1 


46-7 




11 


Denmark 


18-3 


59-1 




12 


Australia 


18-0 


59-9 




13 


USA 


17-9 


61-4 




14 


Sweden 


15-0 


61-6 
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(see [5], p. 82). Even a casual glance at the data contained in 
this table indicates that the birth rate decreases as the con- 
sumption of animal protein increases. We are dealing here 
with a case of negative correlation. This term is used to define 
the type of correlation in which an increase in the value of 
one random variable is accompanied by a decrease in the 
value of the other. The relationship between the birth rate 
and the consumption of animal protein becomes even more 

GRAPH 2. 
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apparent on a [scatter diagram (Graph 2). The trend in the 
distribution of the points on the graph is very distinctly 
marked. Of course, neither the table nor the graph provides 
a basis for accepting the hypothesis. 

Methods for testing hypotheses of this kind will be discussed 
later. We shall here, however, take the opportunity to say 
a few words about apparent or spurious correlation. This 
is the type of correlation in which a'relationshipjappears between 
statistical series, but there is no causal^relation between the [phe- 
nomena described by these series. 'For instance, let phenome- 
non A be causally related to phenomenon B and phenomenon C. 
There will be a correlation between the statistical series describ- 
ing A and the series describing B. The same situation will exist 
with regard to C which will show correlation with A. It is clear 
that owing to the correlation between the series describing 
A and B and the correlation between the "series describing 
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A and C, there may also appear a correlation between the 
series describing B and C. However, this type of relationship 
between statistical series is of a formal mathematical nature, 
since there is no direct causal relation between B and C. 

Statistical experience supplies many illustrations of spurious 
relationships. For example Tschuprow [59] states that the 
statistics on compulsory fire insurance in prewar Russia 
showed an unusually close relationship between the average 
number of buildings destroyed in one fire and the application 
of fire engines to extinguishing fires. The evidence shows there- 
fore that the losses caused by a fire were greater in cases 
where fire engines were used, and smaller in cases where 
they were not used. This might be taken to indicate that in 
order to reduce losses caused by fires the use of fire engines 
should be abandoned. The explanation was of course that fire 
brigades were usually called only in the more serious cases. 
When a single building was on fire and there was no danger 
that it might spread to other buildings, fire brigades usually 
did not interfere. In this case, then, there is an interdepend- 
ence between the intensity of a fire and the participation 
of a fire brigade which uses fire engines. There is no 
causal relationship, however, between the application of fire 
engines and the number of buildings destroyed. This is an 
example of a spurious relationship. 

In prewar Russian statistics we find further interesting 
examples of spurious relationships. For instance, it has been 
established on the basis of abundant statistical material that 
when a doctor assisted in child birth, the percentage of still- 
born children was higher than in cases delivered by a mid- 
wife. At first glance it might appear that in order to reduce 
infant mortality, doctors should not be called, a conclusion 
which is obviously absurd. 

It turns out that the relationship observed is a spurious 
relationship. It should be remembered that formerly the doctor 
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was called at childbirth only in serious cases which often 
ended in the death of the infant. Hence the numerical rela- 
tionship between infant mortality and the assistance of qual- 
ified medical personnel. 

The relationship between a drop in crop yields and the 
number of fires, observed by statisticians, also belongs to 
the category of spurious correlations. The number of fires 
increased in the years when precipitation was low. In the 
same years yields were lower than average. 

An amusing case of spurious correlation between the 
number of registered births and the number of storks has 
been noted in Scandinavian countries on the basis of abun- 
dant statistical data. 

The above examples of spurious correlation prove that 
categorical judgments about the existence of causal relations 
should not be formed on the basis of numerical relationships. 
A causal relation may exist but does not necessarily exist. When 
there is a causal relation between observed phenomena it may 
be expected that there will also be a numerical relationship. 
This relationship may sometimes appear very distinctly and 
sometimes less distinctly, or it may be so weak that it will 
hardly be noticeable. Such a relationship exists, however, 
when there is a causal relation between the phenomena studied. 

Inferring that there is a causal relation on the basis of a 
numerical interdependence may lead to an absurd conclusion, 
as we have seen from the above examples. Their very absurdity 
protects us from accepting them. However, if the conclusions 
resulting from a hypothesis based on mathematical premises 
that a causal relation exists between phenomena are not 
absurd then the temptation to accept them may be very strong. 
One must not yield to such temptations. If there is a relationship 
between two statistical series the following cases are possible: 

1) there is a causal relation between the phenomena described 
by these series; 
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2) there is no direct causal relation between the phenomena 
described by these series because they are correlated with 
another, unknown series describing a phenomenon caus- 
ally related to the phenomena studied by us; 

3) the observed relationship is accidental. 

Choosing the first possibility would be tantamount to giving 
it priority over the remaining two without any foundation. 
The mere existence of a correlation between statistical series 
is only a signal that there may exist some direct or indirect 
relation between the phenomena described by these series. 

Summing up, we may formulate the following rules: 

1) if a causal relation has been discovered between two 
phenomena then correlation analysis may be used to 
determine the strength of this relation; 

2) if a causal relation has not been discovered, but it may 
be assumed that such a relation between the studied 
phenomena does exist, then the appearance of a distinct 
correlation on the basis of more abundant statistical 
material substantially strengthens the hypothesis that 
a causal relation exists; 

3) finally, if before making observations there were no grounds 
for postulating a relationship between two phenomena, 
but after observations have been completed and statistical 
material compiled a distinct correlation between statis- 
tical series can be noticed, then there is reason to assume 
that a causal relation may exist between the phenomena 
studied. It follows that even a formal analysis of a nu- 
merical relationship is fully justified since it may lead 
to a scientific discovery. 

The usefulness of the rules formulated above becomes 
particularly apparent in Example 2, where we deal with a very 
strong interdependence between two statistical series: between 
the series on the birth rate in different countries and the 
series on the daily consumption of animal protein in those 
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countries. In spite of a strong correlation between these series 
it should not be inferred on this basis that there exists a causal 
relation between fertility and the consumption of animal 
protein. Various sciences are engaged in discovering and ex- 
plaining causal relations. Statistics facilitates these tasks for 
them by supplying useful research tools. 

In economic research, regression analysis and correlation 
analysis find very many applications. In company economics 
or in "micro-economics", the ^whole cost and effective- 
ness theory has been worked jout for industrial enterprises 1 
on the basis of regression analysis. Amongst studies in this 
field the following should be named: [9], [12], [13], [18], [19], 
[23], [37], [49], [57]. Correlation analysis can also be applied 
in the economics of the firm to the analysis of the velocity of 
circulation of liquid assets, to studies on the productivity of 
labour and the degree of utilization of working time, to the 
analysis of wages and the wage fund, etc. A separate field for 
the application of correlation methods is that of the tech- 
nology of production. Correlation analysis and particularly 
regression analysis can be of great service in studying the 
influence of technological processes on the quality and cost 
of the product and on the length of the production period. 

In macro-economic studies correlation is used primarily 
for determining Engel curves and supply and demand curves. 
There are many works in this field. We shall mention here 
only some of the more important: [3], [22], [36], [39], [45], 
[46], [50], [51], [60]. 



1 Since the author has been engaged so far in studying the applications 
of linear regression primarily to the analysis of the effectiveness of the 
industrial enterprise, most of the examples quoted in the book are from 
this field of research. The book also contains examples of other applica- 
tions, since the author is interested in demonstrating by many and 
diversified examples the usefulness of the two-point method proposed 
by him for the determination of regression parameters. 



12 Linear regression 

An important and now widely studied statistical problem 
is the application of correlation methods to the analysis of 
time series (the determination of the trend, the analysis of 
seasonal factors, the auto-correlation of time series, correlo- 
grams). Among the more important works the following 
should be mentioned: [9], [33], [62]. 

1.2. Two-dimensional random variables 1 

1.2.1. Definitions and symbols 

Let D be a given set of events forming a complete group 
(see [25], p. 22). If a pair of numbers has been assigned to 
each of the events, these numbers may be treated as the values 
of two functions determined on the set D. 

Definition 1. A pair of functions of real variables deter- 
mined on the set D is called a two-dimensional random variable. 
Two-dimensional variables are usually denoted by the symbol 

f = (*i,JT,). (1) 

Definition 1 can easily be generalized to include multi- 
dimensional variables. In addition to notation (1) a two- 
dimensional variable may also be denoted as follows: 

= (*,*). (2) 

In this work we shall use only notation (2). Multi-dimensional 
variables are generally denoted by 

S = (X 19 X... 9 XJ. (3) 

Random variables are sometimes interpreted geometrically. 
To each event from set D a certain arbitrary point on the 
plane correspond so that the set of events D has a correspond- 
ing set of points D' on the plane. The location of points on 



1 Items 1.2.1., 1.2.2. and 1.2.3. have been published in paper [29], 
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the two-dimensional plane J? 2 is determined by two compo- 
nents. These components of the points of set D' are the two- 
dimensional random variable (X, Y). 

An equivalent of a random variable in statistics is a sta- 
tistical characteristic. A population with two characteristics 
is called a two-dimensional population. Two characteristics 
of a population are equivalent to a two-dimensional random 
variable and particular statistical observations expressing 
the values of each of these characteristics for particular statis- 
tical units belonging to the population analysed are equiv- 
alents of the realization of the two-dimensional random 
variable. An example of a two-dimensional population is the 
labour force of a factory studied from the point of view of 
seniority in employment and earnings. 

Similarly, as in the case of one-dimensional random varia- 
bles, two-dimensional random variables may be treated as 
discrete random variables and continuous random variables. 

1.2.2. Two-dimensional discrete random variables 

Definition 1. The two-dimensional random variable (X,Y) 
is a discrete variable if the sets of values of variable X and 
variable Y are finite or denumerable. 

Definition 2. The distribution function of the two-dimen- 
sional random variable is a function which assigns appro- 
priate probabilities to the values of this variable. The distri- 
bution function of the two-dimensional discrete random 
variable is expressed in the following way: 

P( X=x i9 Y=y,) = p iJ . (1) 

If a set of values of the variable is finite then these values 
and the probabilities corresponding to particular values 
of the variable can be set out in the following contingency 
table: 
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TABLE 1 
CONTINGENCY TABLE 



Pn 



PZ2 



Pi* 



Pmi Pmz 



Pmi 



Pin 
Pzn 



Pin 



Pmn 



PI. 



PI- 



Z 

i 



P-l 



P-j 



P-n 



Since a set of events which determines the two-dimensional 
random variable forms a complete group of events, then 



S 

1=1 y- 



(2) 



Sum (2) is obtained by adding together all the probabili- 
ties contained in Table 1. This can be done in two ways: by 
summing up the rows and then the sums of the rows in the 
last column, or the other way around, by summing up the 
columns and then the sums of the columns in the last row. 
It follows that the sum of the last column equals the sum 
of the last row and equals one, i.e. 



(3) 



(4) 
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where 

n 

P l .= : %Pij> ( 5 ) 

./=-! 

/.,= >>. (6) 

1=1 

It follows from equations (3) and (4) that the probabilities 
shown in the last colun n and in the last row of Table 1 form 
distributions. They are called marginal distributions of the 
discrete random variable (X, 7). 

Let us write the sum on the right side of formula (5) in a 
developed form: 



After dividing both sides of equation (7) by p lf we get 

-| f-,..-j == 1 w/ 

Pi. Pi. Pi. 

Since sum (8) equals unity then we have a probability 
distribution. It is the conditional distribution of Y on X. 
Let us denote 



(9) 
Pi. 

where p(y j \x i ) is the conditional probability r that Y = y j9 
based on the assumption that Xx i9 and X*^;) * s ^ e 
conditional probability that X= x t9 assuming that Y=y f . 
Therefore, formula (8) may be written in the following short 
form: 



Similarly the conditional distribution of A" on Y may be 
presented in the following form: 



16 Linear regression 

jJx*iW=i. (ii) 

1 = 1 

On the basis of (9) 



Pi, = Pi. - P(y, \ *,) = p mi . p(x % | y s ) . (12) 

It follows from formula (12) that the two-dimensional 
joint probability equals the product of the marginal probability 
of one variable and the conditional probability of the other. 
The term "joint probability" is denoted by p tj . This empha- 
sizes the fact that p tj is the probability of a two-dimensional 
variable, whereas p l9 p J9 p(x i \y } \ p(y 3 \x^ are the probabilities 
of one-dimensional variables. 

Definition 3. Two discrete random variables X and Y are 
independent if for all /, j 



*) = />., (13) 

or, what amounts to the same thing: 

p(*i\y} = Pi, (14) 

In this case formula (12) assumes a simpler form: 

Pi, = Pi.-P.r (15) 

Hence it follows on the basis of (12), (13) and (14) that 
for r ^ m, s ^ n the following equality holds: 



r s r s 



7=1 



s r 



(16) 
and finally 
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Definition 4. Function F(x,y) = P(X< x,Y<y) is called 
a two-dimensional distribution function of random variable 



When the two-dimensional random variable is discrete 
then it follows from the definition of the distribution function 
that 

) = S I> 08) 

x<x 



and 

F( + oo, + oo) = />=!. (19) 



jc<oo 



The marginal distribution of variable X is expressed by 
the formula 

F( X , + oo)= Pi , (20) 



or 



The formulae for the marginal distribution of variable Y 
are analogous: 

(20 



It follows from formulae (17), (20) and (21) that if two 
discrete random variables X and Y are independent the two- 
dimensional distribution function (X,Y) is equal to the product 
of the distribution functions of one-dimensional variables 
X and y. The reverse statement is also true. 



1 The definition of one-dimensional distribution function is analo- 
gous. The distribution function of x is function F(x) = P(X < #). 
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1.2.3. Two-dimensional continuous random variables 

Definition 1. The density function f(x 9 y) of the two-dimen- 
sional random variable (^,7) is a mixed derivative of second 
order of the distribution function F(x 9 y) with respect to x 
and y of this variable at point (x 9 y) 9 i.e. 



oxoy 

Definition 2. The two-dimensional random variable (X,Y) 
is continuous if its distribution function F(x 9 y) is continuous 
and if the density function f(x,y) is also a continuous function 
with the possible exception of a collection of points belonging 
to a finite number of curves. 

On the basis of Definition 1 we have: 

F(x,y)==' J* ff(u,v)dudv (2) 

k oo oo 

andj 

\ 9 (3) 



00 00 

F(-,-o )=F(-o,y) = F(x,-oo)=0. (4) 

The marginal distribution of variable X is expressed by 
the formula 

F(x,oo)= / ff(u,v)dud v = fftuydu. (5) 

ou - oo - oo 

In formula (5) 

(6) 



denotes the marginal density of variable X. 

The formulae for the distribution function and marginal 
density of variable Y are analogous owing to the symmetry 
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of the formulae. We shall denote the marginal distribution 
functions of variables X and Y by F^x) and F%(y). 

In discussing the two-dimensional discrete random variable 
we have given the definition of conditional probability (p. 15, 
formula (9)). 

When the random variable is continuous, we shall under- 
stand the conditional probability 



as the expression 



(7) 



assuming at the same tim? that P(x ^X< x+A x ) > 0. 
Comparing formula (7) with formula (9) from the preceding 
section we can easily notice a formal similarity between them. 
Indeed, in formula (7), instead of quantities x i and y t we 
have put in expressions (x < X < x+A x ) and (y < Y < y+A y ) 9 
respectively. For continuous random variables, probabilities 
corresponding to particular values of the variables always 
equal zero, since we have 



P(X= x) = P(Y= y) =--P(X=x,Y=y)= 0. 

Conditional probability (7) is the probability that point 
(X,Y), chosen at random, will be located in rectangle 

GRAPH 1. 




2* 
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y+A y , x^X<x +A X when it is known that this 
point lies within area x < X < x+A x9 oo<Y<oo (Graph 1). 
It follows from the definition of the distribution function 
of the continuous variable that 



/ / /( v) rfw dv 

= - > (8) 

*+* > 

/ ff(u,v)dudv 

x oo 

Of course 

= 1. (9) 



The conditional distribution function is expressed by the 
formula 



ff(u,v)dudv 

- , (io) 



*+4r oo 

/ / /O, v) rfw rfv 

x oo 

and the density f(y\x) of the conditional distribution by 
formula 



In formulating the definition of conditional probability (see 
formula (7)) we considered variable Y on the assumption that 
variable X satisfies the inequality x ^X< x+A x . 

It is easy to find formulae for variable X which are sym- 
metrical to formulae (7) (11) on the assumption that 
variable Y satisfies the inequality y^Y<y+A y . 

On page 16 we have given the definition of an independent 
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discrete random variable. We shall now give the definition of 
an independent continuous random variable. 

Definition 3. Two continuous random variables X and Y 
are independent if 



Y < y 2 ), (12) 

where x l9 x% and y l9 y 2 are any real numbers. It is easy to 
prove that for two random variables to be independent it is 
necessary and sufficient that their joint two-dimensional 
distribution function equal the product of the marginal 
distribution functions of variable X and variable Y: 

F(x 9 y) = F 1 (x).F,(y). (13) 

The same theorem was given above for discrete variables. 

1.2.4. Moments of a two-dimensional variable 

Definition 1. The relative moment of a two-dimensional 
random variable (X,T) is the expected value of 



where / and k is the order of the moment and /, k are non- 
negative integers. C and D 9 which can be considered as the 
coordinates of an arbitrary point, are any real numbers. 

Definition 2. If C = D = the expected value of the 
product x?y k is called the ordinary moment (or simply the 
moment) of the two-dimensional random variable X 9 Y. These 
moments are usually denoted by the symbol m lk . 

In accordance with this definition 

m lk = E(X> Y*) = B. ^ a. 
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where variable (X,Y) is discrete, and 



a oo c oo 

b - + oo d~> + oo a c 



b d 

f f x l y k f(x,y)dxdy 

J J 



= / x l y*f(y,x)dxdy, (2) 

-00 -00 

where variable (X,Y) is continuous. 

The most frequent use is made of moments of the first and 
second order. Moments of the first order are the expected 
values of the random variables X and Y: 



(3) 
and 

mn=E(XY^=E(Y). (4) 

The moment of the second order defined by formula 

m u =E(jrr) (5) 

is called a product moment. 

The remaining two moments of the second order are expressed 

by the formulae 

m 40 = E(X 2 7) - E(X*) (6) 

and 

m 02 =E(X<>Y*)=E(Y*). (7) 

Definition 3. Moments with reference points C = E(X) 
and D = E(Y) are called central moments. They are usually 
denoted by p lk . 
We then have 

/i tt = E[(X - w 10 )< (7 - m dl ) k ] . (8) 

Of course 

Ao = E[(X - Hi,,) 1 (Y - m 01 )] - (9) 

and 

= 0. (10) 
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In our further considerations three central moments of the 
second order will be of great importance: 

(11) 



and 

l*to=E[(Y-mn?\=V(Y). (12) 

These are variances of random variable X and random variable 
Y. The mixed central moment of the second order 

^=E((x-m^(Y-m Q j\ (13) 

is known as a covariance. It is often denoted by C(X,Y). 

Central moments of a two-dimensional random variable 
can be expressed by ordinary moments, and vice versa. It 
is easy to show, for instance, that 

[*ii = w u w 10 w 01 . (14) 

Indeed 

/in = E[(X - m 10 ) (Y - w 01 )] = E(XY) - 



+ m 10 m 01 = m u m lo w 01 . 
Similarly, it can be proved that 

^20 = w ao m lo 2 (15) 

and 

^02= w 02 m 01 2 . (16) 

The following important theorem can be proved for covar- 
iance. 

THEOREM 1. If random variables X and Y are independent 
then the covariance C(X,Y) of these variables equals zero. 

Proof 1 . When variables X and 7 are independent, then 

Pi, = Pt. /^ 
(see 1.2.2., formula (15)). 



1 The proof is for discrete variables. The situation is similar when 
the variables are continuous. 
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Therefore 

C(X 9 7)= (*-/ 



i j 

On the basis of (9) and (10) we obtain 
C(y v\ o 

V^^-Aj JL ) \J. 

The converse theorem is not true. 

In addition to the moments thus far defined there is another 
group of moments. They are called conditional moments. This 
term is used for moments of one of the variables X, Y, as- 
suming that the remaining variable has a certain definite 
value. 

In our further considerations we shall use two conditional 
moments: the expected conditional value and the conditional 
variance. If variable (X,Y) is continuous these parameters 
are expressed by the respective formulae 

oo 

/ yf(x,y)dy 

E(Y\X= x) = m 01 (x) = = Jyf(y I x)dy, (17) 

00 _ 

/ f(x,y)dy 

oo 

J[y-m 01 (x)]*f(x,y)dy 



jf(x,y)dy 

oo 

= J\y-m 01 (x)]*f(y\x)dy. (18) 

oo 

These are the conditional moments of variable Y. An analogous 
pair of formulae can be given for variable JT. 
When variables X and Y are independent, then 
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f 

J 



7.2.5. Regression I 

In a two-dimensional distribution of variable (X,Y) the 
expected conditional value E(Y\X=x) is a function of the 
variable x. We may thus write: 

Substituting the simple symbol y for the expression (71^ = x) 
in the above formula, we get 

J> = fc(*)- (2) 

Equation (2) is known in mathematical statistics as the equa- 
tion of regression I of Y on X. 

By interchanging letters x and y in formulae (1) and (2) 
we get the regression equation of variable X on Y 

* = 200- (3) 

If variable (X, Y) is continuous, then the geometrical repre- 
sentations of functions (2) and (3) will be lines. These lines 
are called regression I lines. 

GRAPH 1. 
4Y 
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In Graph 1, the regression line of Y on X is shown. The 
ordinates of this curve represent the expected values of var- 
iable Y when variable X = x. If the equation of this line is 
known then to each value of variable X we can assign an 
expected value of variable Y. 

GRAPH 2. 




We shall prove the following 

THEOREM 1 . The expected value of the sum of the squared 
deviations of Y from the regression line is a minimum. 
Proof. We are to prove that 



But 



As we know, E(Y u) 2 = minimum when u = E(Y). 
Then 

E(Y-u)*=V(Y). 
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Hence, for E[Yg(X)] 2 to have a minimum it is necessary 
that 



because then 

f[y-E(Y\X=x)]*f(y\x)dy 

oo 

equals V(Y\X = x), i.e. is a minimum. Therefore 

E[Y-g(X)]*= //,(*>& J\y-m m (x)]*f(y\x)dy 



7.2.5. Regression II 

In practice the shape of function g(x) is rarely known. 
The usual procedure is to take a sample from a two-dimensional 
population and to draw a scatter diagram. The points on 
the graph follow a more or less distinct trend. This trend 
provides certain information', on its basis the hypothesis 
may be formulated that function g(x) belongs to a certain 
class of functions (e.g. to the class of linear, exponential or 
power functions or to the class of polynomials). 

Graph 3 presents a scatter diagram drawn on the basis 
of statistical material collected in connection with studies on 
the relationship between hop consumption and the produc- 
tion of wort. The statistics were obtained from the Piast 
Brewery in Wroclaw. 

The trend of the points on the graph is so distinct that we 
can safely postulate the hypothesis that g(x) belongs to the 
class of linear functions. 

In order to determine the parameters of function g(x) 
the expression E[Y g(x)] 2 has to be made a minimum. 

If g(x) = ax then to find the value of parameter a we have 
to minimize the value of the expression S == E[Yax] 2 . 
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We must calculate the derivative and equate it to 

da 

zero. From the equation thus obtained a can be determined. 
The line determined in this way is called a regression II line. 
If the hypothesis concerning the class of functions to which 

GRAPH 3. 







5 W 

wort (thousand hi/month) 



g(x) belongs is true then the regression II line coincides with 
the regression I line. In applications we are always interested 
in regression I lines. Since the equations of these lines are 
usually not known we substitute regression II lines for 
regression I lines because the former are easier to determine. 
While doing this we are seldom free from worry as to whether 
the regression line has been properly determined because 
information provided by a scatter diagram is scanty and so 
the hypothesis regarding the class of functions to which g(x) 
belongs may easily turn out to be wrong. 

It happens sometimes that apart from a scatter diagram 
we may have at our disposal some additional information 
providing a basis for the hypothesis concerning the class of 
functions to which g(x) belongs. For instance, sometimes 
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we know the equation of asymptotes of the regre:sion line, 
or we know that this line passes through the origin, or that it 
does not intersect the positive part of the jc-axis and the nega- 
tive part of the >>-axis. Such information is very valuable. 
It always comes from sources outside statistics. One of the 
conditions of the effectiveness of statistical analysis is a thor- 
ough knowledge of the subject being studied and of the division 
of science which is concerned with it. This means, for instance, 
that to analyse by statistical methods the effectiveness of 
penicillin in fighting tuberculosis we need phthisiologists, and 
to study the effect of the price of butter on the consumption 
of edible oils we need economists. The knowledge of statistics 
alone is not sufficient. Only the combination of statistical and 
non-statistical information can make our analysis fruitful. 

This principle is fully applicable to the determination of 
regression lines. 

7.2.7. Linear regression 

Definition 1. If an equation of regression is expressed by 
the formula 

;p = a al x + 20 , (1) 

we say that the regression of Y on X is linear. 

Formula (1) is a regression equation of Y on X. 

Quantities a 2l and /? 2 o are certain constants called regression 
parameters. The indices next to the parameters serve to distin- 
guish the regression parameters of Y on X from those of X 
on Y. The first index is for the dependent variable in the 
regression equation and the second for the independent. 

The linear regression equation of A" on Y is expressed by 
the formula 

jio. (2) 
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Sometimes we shall write equation (1) omitting indices, i.e. 

y = ax + ft. (3) 

In such cases it should be understood that our considerations 
refer to both types of regression lines, i.e. the regression of 
Y on X and the regression of X on Y. 

If we know that in the distribution of the two-dimensional 
random variable (X 9 Y) the regression lines are straight, then 
in order to determine the value of parameters a and /? we have 
to find the minimum for the expression 

E[Y-aX -| 2 = f[y -ax -ffidP, 

where R 2 denotes the two-dimensional integration space and 
dP the differential of the two-dimensional distribution. 

We calculate the partial derivatives of the expression in 
brackets on the left side, with respect to a and /9. 
We have 

E(Y aXp)* = -2E[(Y-aX- 
da 

and 



aft 

By equating both these derivatives to zero we obtain a set of 
normal equations 



After replacing the expected values by appropriate moments 
this set of equations can be written in the following form: 

/?w 10 = 0, _ 



m u am zo /?w 10 
W 01 -aw 10 ~-^=0 
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From the solution of the set of equations (5) we get: 



"*u m 01 m 10 ,~ 

a = a 31 = - --- - --- (/) 

w 20 m 10 

On the basis of 1.2.4. (14), (15) and (16), formula (7) may be 
written thus: 

a = a a i ==s - 00 

/*20 

Similarly we get 



and 

a u = -^-. (10) 

/% 

Parameters a 21 , and a 12 are called regression coefficients. 
Substituting (6) in (1) and (9) in (2) we obtain the regression 
equations in the following form: 

y = a 21 (* w 10 ) + w 01 , (1 1) 



It follows from the above equations that both regression 
lines pass through the point with coordinates (m 10 , AW OI ). We 
shall call this point the population centre of gravity. 

The knowledge of regression equations allows us to express 
the stochastic dependence between random variables by a 
mathematical function describing numerically the relationship 
existing between these variables. The derivation of a function 
formula is very convenient since it allows us to assign to 
each value of a random variable, appearing as an argument, 
an appropriate value of the other variable appearing as a 
function. 
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We have stated on page 5 that the significance of re- 
gression line equations consists in the fact that they enable us to 
estimate particular values of one variable on the basis of the 
values assumed by the other variable. This estimate may be 
better or worse, more accurate, or less accurate. When we use 
the word "estimate" we must also introduce the notion of the 
"accuracy of the estimate" and create a measure of this ac- 
curacy. 

Of course, the smaller the sum of errors 1 thai we commit by 
replacing the real values of the random variable by the values 
obtained from the regression line equation, the better the esti- 
mate mil be. 

This statement lends itself to geometrical interpretation. 
The more closely the points are grouped around the regression 
line on the scatter diagram, or, what amounts to the same 
thing, the smaller is the dispersion of the points around the 
line, the better the estimate will be. 

Let us call the quantity defined by formula 



which is a realization of the random variable e y 7^, the 
i th residual of the regression of Y on X 9 or in brief a residual. 
As a measure of dispersion of points around the regression 
line the residual variance V(e y ) is generally used. It is deter- 
mined by the formula 

V(e,)=E(e*)=E(Y-W. (13) 

If the regression line is a straight line, than 
F(<g = E[Y - (a n X + & )] 2 = E[Y - m 01 + a ai m 10 + /5 20 - 
- a** - &o] 2 = E[(Y - m cl ) - 



1 We are not interested here in how these errors are measured: by 
absolute values, squared value, or in some other way. 
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where W = Y AH OI and U = X m 10 . 
And further 



On the basis of (8) we have 

V(e y ) = ^ 2 l - 



^20 
Analogously it can be shown that 

V(e x ) = ^ 20 a!^! . (1 5) 

The square root of the residual variance we shall call the 
standard error of the estimate and we shall denote it by the 
symbols cr 21 and a 12 respectively for the regression of Y on 
X and the regression of X on Y. In this case 



and 



Using the symbols for variance and covariance we may present 
formula (14) in the following way: 



V(eJ = V(Y) - a 21 C(XY) - F(JO - <&V(X). (18) 
Hence, it follows that 



or, what amounts to the same thing, that 

. (19) 



V(Y) 

In the conclusion of our comments on the two-dimensional 
regression line let us quote two important theorems and their 
proofs. 
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THEOREM 1. If a regression I line is a straight line then 
the regression II line coincides with the regression I line. 
Proof. By assumption we have 



Taking the mathematical expectation of both sides of the 
equation we obtain: 



But 
Hence 



This means that the regression I line passes through the centre 
of gravity. We have shown on page 31 that the regression 
II line also passes through this point. This enables us to in- 
troduce new variables: 

U=X-m lo , W= r-w 01 . 

Consequently the equation for regression I will assume the 
following form: 



and the equation for regression II will be expressed by the 
formula 

w= a'u, 

where a' denotes the regression coefficient in this equation. 
Therefore 

E(W - a'U) 2 = E[(W - aU) + (aU - a'U)] 2 
= E(W - aU)* + 2E(W - at/) [(a - a')C/] + E(aU - a' t/) 2 . 

The expression 

E(W-aV)[(a-a)U] 

is zero for each determined value of U since it follows from 
the assumption of linearity that 
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Thus we have 

E(W - a'C/) 2 = E(W - at/) 2 + E(aU - a'U) 2 . 

Since the first term on the right side of the equation does 
not depend on a', then the expression E(lV-a'U) 2 has a 
minimum for the same value of a' as the expression 
E(aU a'U) 2 , and that expression has a minimum when 

E(aU-dU)*= 0, i.e. when a- a'. 
But a' is determined by the condition 

E(Y aX #) 2 = minimum, 
which is equivalent to the condition 

E(W a' tO 2 minimum, 

and this condition is fulfilled when a = a'. Therefore, when 
the regression 1 line is straight it coincides with the regres- 
sion II line. 

THEOREM 2. When the regression I line is a straight line 
then the residual variance is a minimum. 

Proof. The parameters of the regression II line are deter- 
mined by the condition (see p. 30): 

(ej) = E(Y - a'X -/3') 2 = minimum. 

It follows from Theorem 1 that both the regression I and 
regression II lines coincide in the case of linear regression. 
This means that for the regression I line this condition is 
also fulfilled, which proves that the theorem is correct. The 
theorem proved is a special case of theorem 1 on page 26. 

1.2.8. Correlation. Correlation ratio and correlation 
coefficient 

On page 32 in formula (13) the definition was given for 
the residual variance as a measure of the dispersion of points 
(x,y) around the regression line. 

3* 
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However, the application of the residual variance is not 
limited to measuring only the dispersion of points around 
the regression line. Let us note that the smaller the dispersion 
the closer is the bond between random variables X and Y. 
When all the points lie on the regression line there is no dis- 
persion at all and V(e) = 0. This is a case of a functional 
relationship between variables X and Y and not of a stochastic 
relationship. 

It follows that quantity V(e) may be used for measuring 
the dependence between two random variables. Indeed the 
residual variance is used for this purpose although not in 
the form described in the definition. 

A measure of dependence between two random variables 
should meet the following requirements: 

1) it should have no dimension; 

2) it should be normalized and assume values belonging 
to a certain finite numerical interval; 

3) it should assume increasing values when the dependence 
becomes stronger, and decreasing values when it be- 
comes weaker; 

4) it should not depend on whether the dependence of X 
on Y is measured, or vice versa. 

None of these requirements is satisfied by V(e). With the 
help of several simple mathematical operations, however, 
we can construct a quantity which will fully satisfy these 
requirements. In order to satisfy requirements 1) and 2) 
it is sufficient to divide V(e y ) by V(Y) or V(e x ) by V(X). Indeed, 
since F(e y ) and V(Y) are of the same dimension, then 



has no dimension. 

It follows from formula (19) on p. 33 that 

0<-^>-<l, 
V(Y) 
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which means that this quantity is normalized in the inter- 
val [0,1]. 

V(e ) V(e ) 

Requirement 3) will be met if instead of and 

4 V(T) V(X) 

we introduce the quantities 



TO 

n l=l-- (E ^-. (2) 

TO 

Quantities r\ v and r/ x defined by formulae (1) and (2) are known 
as correlation ratios 1 . 
Of course 

0^17, <1. (3) 

The correlation ratio r\ y equals one if and only if V(e y ) = 0, i.e. 
when the dependence between random variables X and Y 
is of a functional type. 
When 

it is said that the random variables are correlated with one 
another. When 

we say that they are uncon elated. All that has been said about 
correlation ratio r\ y also applies to rj x . 

The lack of a correlation between random variables does 
not mean, by any means, that these variables are sto- 
chastically independent. As we know (see p. 21, formula 
(13)) random variables X and Y are independent if 

F(x.y) = ^i(#).-F a (y), (4) 

and they are uncorrelated if 

rj y = or YI X = 0. (5) 

Relations (4) and (5) are not equivalent. 



1 This measure was introduced by K. Pearson. 
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In order to satisfy requirement 4) we assume that both 
regression I lines are straight lines. Therefore 



(see p. 33, formula (18)), and analogously 



Hence 

C(XY) 





V(Y) V(Y) 

(see (12), (13) on p. 23) 
Similarly 



Therefore, when the regression lines are straight lines 

?!, = fla- 

The quantity 

Q = ]/o ia .a 21 (6) 

is called the correlation coefficient. Since the correlation 
coefficient is a special case of the correlation ratio, then 
on the basis of (1) we also have 

2 < 1 or 1 <g< 1. 

When Q > we say that between random variables X and Y 
there is positive correlation. In cases of positive correlation an 
increase in the value of one variable is accompanied by an 
increase in the expected value of the other. Let us note that 
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By this convention the sign of the correlation coefficient de- 
pends on the sign of // u . 
On the other hand, 



Since // 20 > and ^ 02 > 0, the signs of a 12 and a 21 also depend 
on the sign of /%. Hence, when Q > 0, a 12 > and 21 > 0. 
But a 32 and a 21 are slopes of the regression lines. Therefore, 
when > the regression line y = a 21 x+/? 20 forms a sharp 
angle with the r-axis which means that y is an increasing 
function of x. 

When < we deal with negative correlation and then an 
increase in the value of one variable is accompanied by a 
decrease in the conditional expected value of the other. 

The position of the regression lines for a positive corre- 
lation is shown on Graph 1 and the position of these lines for 
a negative correlation is shown on Graph 2. 



GRAPH 1. 



GRAPH 2. 





The correlation coefficient equals +1 or 1 only when 
all points (x,y) lie on the straight line. 
Let us now prove 
THEOREM 1. When 2 = 1 the two regression lines coincide. 
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Proof. Let us assume that Q = +1 (the proof is analogous 
when we assume that Q = 1). In this case it follows from the 
definition of the correlation coefficient that 

Q 2 = a la . a 21 = 1 . 

The parameter a 12 is the tangent of the angle that the 
regression line = a 12 y + Pio forms with the ^-axis. The 
slope of this line, with reference to the horizontal axis, 

equals . The inclination parameter of the regression line 

ia 
$ = a 20 +/S 01 with reference to the horizontal axis equals a 21 . 

The tangent of the angle contained between two regression 
lines then is 

1 

a 21 



j , _21 !2 21 

12 

Since the tangent of an angle equals zero when the angle 
equals zero, the result obtained indicates that the lines coin- 
cide. 

When the correlation coefficient equals zero, the random 
variables X and Y are uncorrelated. 

It can easily be shown (the proof is the same as for Theo- 
rem 1) that when Q = then the angle between the regres- 

sion lines equals. Let us note further that the correlation 
2 

coefficient equals zero only when C(XV) = 0. However, since 
_ C(AT) _ C(XT) 

Cho -- Ctgi 

V(X) V(Y) 

then if # 2 = 0, a n = and a 31 = 0. Thus, when variables 
X and Y are not correlated the regression lines intersect at the 
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angle n and the regression line of Y on X is parallel to the 

horizontal axis and the regression line of A" on Y is parallel 
to the vertical axis (Graph 3). 

On page 23 we proved the theorem that if random variables 
X and Y are independent, the covariance C(XY) equals zero. 
We shall now express this theorem in a slightly different form. 

THEOREM 2. If variables X and Y are independent, then they 
are also uncorrelated. 

The converse theorem is not true. The theorem contra- 
positive to Theorem 2 is of great practical and theoretical 
importance. It is: 

THEOREM 3. If random variables X and Y are correlated they 
are also dependent. 

GRAPH 3. 



Theorem 3 does not have to be proved since it is a theorem 
contrapositive to Theorem 2, which is true. As we know, two 
contrapositive theorems must be either both false, or both true. 

To conclude our discussion of the correlation coefficient 
we shall prove the important 

THEOREM 4. For any random variables X and Y there is 
always a linear transformation which can bring these vari- 
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ables to the form in which the correlation coefficient Q be- 
tween the transformed variables equals zero. 

Proof. Using known formulae for translating and rotating 
the coordinate system we have 

X' = (X - m 10 ) cos 9 + (Y - w 01 ) sin 0, 
y = (A' w 10 ) sin + (7 w 01 ) cos 0. 

For variables X' and y ' to be uncorrelated it is necessary and 

sufficient that C(X'Y'} = E(X'Y f ) = 0. 

In accordance with formula (13) item 1.2.4.: 

E(X'Y) = E{[(X - w w ) cos + (Y - m 01 ) sin 0] [ - (JT - 
w 10 ) sin + (y w 01 ) cos 0]} . 

After multiplying and bringing the sign of the expected value 
within the brackets we get 

E(X'Y r ) = E[(X - m 10 ) (Y - m 01 )] cos 20 - ! [E(X - m 10 ) 2 

- E(Y - m 01 ) 2 ] sin 20. 
Divide both sides of the above equation by cos 20 

=E[(X - m 10 ) (Y - m ol )] - - [E(X - m 10 ) 2 - 



cos 20 2 

- (y - m 01 ) 2 ] tan 20 = /% - - (/i ao - ^02) tan 2 . 

2 

If we choose the rotation angle so that 

tan 20 == ^ 1 , we get (7) 

/% [*02 

E(X'Y')=Q 9 and hence g(jry)= 0. 

Let us note in passing that if we determine angle from for- 
mula (7) and if 

A = tan0, 
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then the line having the equation 

(8) 



has the property that the sum of the squares of the distances 
of points (x,y) from this line is a minimum. The straight line 
(8) is known as the orthogonal regression line. It follows from 
formula (8) that this line passes through the centre of gravity. 
After some elementary transformations we obtain a formula 
for the slope of line (8) 



7.2.9. The two-dimensional normal distribution 

The distribution of the two-dimensional random variable 
(X,Y) with a density determined by the formula 

<f(x,y) (1) 

-wii)' _ 2 (* itiQ Q> m,) 0>-m 2 ) I 
a! ffl a aj J 



2(1 - 



where 



is called a two-dimensional normal distribution. 

The great practical importance of distribution (1) follows 
from the generalized Central Limit Theorem on two-dimen- 
sional variables. On Graph 1 the density surface for a two-di- 
mensional normal distribution is presented. 

THEOREM 1. If the density of random variable (X 9 Y) is 
expressed by formula (1) and if the correlation coefficient Q 
between variables X and Y equals zero, then variables X and 
Y are independent. 
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Proof. If = then 



Denoting 



- 

1/2^0-2 



(2) 



(3) 



we have 



After integrating both sides of the above identity with 
respect to each variable, we get 



x y 



(p(x,y)dxdy = 
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Using the definition of the two-dimensional random variable 
distribution function we may write 



We have obtained the necessary and sufficient condition for 
the independence of the random variables (item 1.2.3., for- 
mula (13)). Thus the theorem has been proved. 

THEOREM 2. The conditional density y(y\x) in a two-dimen- 
sional normal distribution is the density of a normal distri- 
bution with the following parameters: 



E(Y | X = x) = m^ + on(* -iff,), (4) 

K(r|Ar=*) = (i-e)K(y). (5) 

Proof. On the basis of formula (11), item 1.2.3. 



For convenience let us denote 



Then 




1 I ,-m-^fe-i*) V 

2 I tf./l-e' J 
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Formula (4) represents the regression equation of Y on X. 
Since a regression I line is a locus of conditional expected 
values then it follows from the theorem proved above that: 

Corollary 1. Regression lines in a two-dimensional normal 
distribution are straight lines. On the right-hand side of the 
equation in formula (5) only constant quantities appear. Hence 
follows 

Corollary 2. In a two-dimensional normal distribution, the 
conditional variance V(Y\X x) is a constant quantity. Both 
corollaries, as we shall see, are of great practical importance. 

The density surface of a normal distribution shown on 
Graph 1 is a geometrical interpretation of equation (1). The 
intersection of this surface with the plane parallel to plane 
XOY is called the equiprobability curve. Such curves in a 
normal distribution are ellipses. The equation of the family 
of these ellipses is as follows: 

(x mtf (x mj (y - 



_j 



(6) 



where C is a variable parameter dependent upon the parameter 
of the intersecting plane. The centre of this family of ellipses 
is the centre of gravity, i.e. the point with coordinates [m l9 m 2 ]. 
The regression lines are the diameters of the ellipses conjugate 
to the diameters parallel to the coordinate axes. 

GRAPH 2. 



regression of X on Y 




orthogonal 
regression 

, regression 
of Yon X 



orthogonal regression 
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The major axis of the ellipse coincides with the orthogonal 
regression line (Graph 2). 
When Q = 0, equation (6) assumes the form 

(x --mi) 2 (y- m*) 2 __ 2 ,~ 



The major and minor axes of ellipse (7) are parallel to 
the corresponding coordinate axes, X and Y. When Q and 
0^ ~ 0-3, the ellipses become circles. 

Since the sum of the random variables 

(*^) 0^m a ) 

r 



a\ 



has the # 2 distribution with two degrees of freedom, the proba- 
bility that a random point (x,y) is located within the area 
determined by curve (7) is 



The probability that point (x, y) is located within the area 
determined by curve (6), is the same. 

The area determined by the equiprobability ellipse may be 
considered as a characteristic of the dispersion in a two-dimen- 
sional normal distribution. The measure of this area is, then, 
a generalized measure of dispersion comprising two-dimen- 
sional random variables. 

The values of the distribution function and of the density 
function of a two-dimensional normal distribution are given 
in tables (see [40] and [44]). These tables, however, are not in 
general use, and therefore are not easily accessible. For this 
reason, the following expansion of the density function of a 
normal distribution into a series (see [7], [52], [54]) has great 
practical importance. It can be proved that when m 1 = m^= 0, 
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then 



'_ [__ _. ?_<?_^ + jM 

2/l"7 I ffj ajff. a! J 



Wr- 

1 VI Ui 



v\ 

* - 

where 



(8) 



^ 
2 </w. 



Hence, when x = j = 0, 

? r= __ ^_ 
' 



v! 
Integrating both sides with respect to Q, we get 



1 f dQ 1 

>= | __^_-^ = arc sing- 
271 J ]/l-e 2 2jr 

On the other hand, after integrating formula (8) we get 






oo oo 

Of course, if x = y = 0, then 

00 oo r0(t>VQY|2 

J J -^ i>l 

oo oo 1 

Since, however, (0) = i, we finally have 

00 j 

f f y(u, v)du dv~ h arc sin Q (9) 



00 00 



(see [7]). 
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1.3. Non-linear regression in R z 

In 1.2.5. we have defined the regression I line (for- 
mula (1), p. 25). To determine the regression equation of Y 
on X we used the notation 



The concrete form of the regression curve equation is usu- 
ally not known. In order to know the equation of the regres- 
sion curve it is necessary to know the distribution of variable 
(X 9 Y), and this is seldom possible in practice. 

If the distribution of variable (X,Y) is not known, various 
procedures may be used. We shall discuss the more important 
of them. 

1. The hypothesis concerning the shape of curve g^x) based on the 
hypothesis concerning the distribution of variable (X,Y) 

If the collection of values that can be assumed by the random 
variable (X, Y) is so large that it cannot be analysed as a whole, 
then the only statistical source of information about the 
distribution of variable (X,Y) is a sample. All other informa- 
tion about the distribution of variables (X,Y) is non-statistical 
information (see p. 29). In analysing the distribution of a 
random variable we may, and should, take into consideration 
all the information in our possession, both statistical and 
non-statistical. Suppose that on the basis of available infor- 
mation we have postulated a statistical hypothesis H, according 
to which the distribution of random variable (X 9 Y) is normal. 
Let us also suppose that the testing 1 of this hypothesis 
has not provided a basis for rejecting it. In this case we may 
assume that function g^x) is linear, because if the hypothesis 
is true, then, according to Corollary 1 (1.2.9.) the regression 
lines are straight lines. 



1 It will be discussed in Chapter 4. 
4 
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2. The hypothesis concerning the shape of curve gi(x) based on non- 

statistical information 

A hypothesis concerning the shape of the regression curve 
can often be based on information not directly related to the 
distribution of random variable (X,Y\ but pertaining to the 
phenomenon that this variable describes mathematically. For 
instance, in smoothing out a broken curve of a time series 
representing the growth of population of a country within a 
certain period of time, there are reasons to assume that such 
a curve will be exponential if only we make a generally accept- 
able assumption that the rate of population increase during 
the period under consideration was not subject to serious 
fluctuations. 

3. The hypothesis concerning the shape of curve g^x) based on a scatter 

diagram 

If non-statistical information is so scanty that we cannot 
postulate any hypothesis concerning the shape of the curve gi(x), 
then the only way out is to take a sample, to draw a scatter 
diagram and to analyse it. If the points on the diagram are so 
distributed that a distinct tendency in the form of a clearly 
marked trend is noticeable, then on the basis of the shape of 
this trend a certain class of functions can be chosen that 
should be suitable for approximating the distribution of 
points on the scatter diagram. The parameters of a function 
approximating the distribution and belonging to this class 
are determined by an appropriate method (e.g. the method of 
least squares). 

What is striking in this type of approach is the high degree 
of arbitrariness. This approach cannot be taken when there 
are more than 3 dimensions, since in such cases it is not pos- 
sible to draw a scatter diagram. For this reason non-linear 
regression is much less often used than linear regression. 
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It is worth stressing at this point, that non-linear regression 
can always be approximated by a broken curve and thus 
non-linear regression can be reduced to linear regression. 

In many important applications, non-linear regression can 
be reduced to a linear form by an appropriate choice and 
introduction of new variables 1 . 

For these reasons in recent studies on probability and 
mathematical statistics non-linear regression is either com- 
pletely omitted (e.g. [7], [21]) or discussed only briefly and 
in general terms (e.g. [28], [33]). 



1 This will be discussed in Chapter 5. 



2. THE APPLICATION OF REGRESSION AND 
CORRELATION TO ECONOMIC RESEARCH 

2.1. On the relation between economics and mathematics, 
on statistics and econometrics 

The constant changes to which production processes are 
subject as a result of the rapid development of science and 
technique, pose more and more difficult problems to economic 
science. The historical and descriptive methods tradi- 
tionally used by the social sciences are no longer adequate 
to solve these problems. The ability to predict is becoming 
an indispensable tool in the proper management of the pro- 
duction processes. The need for acquiring this ability is felt 
both in a capitalist economy which continually tries to free 
itself from the vicissitudes of market domination, and in a 
socialist economy where the growth targets are determined 
on the basis of long-run economic plans. 

The ordinary meaning of the word "predict" is obvious. 
The term "scientific forecasting", however, requires some 
explanation. By scientific forecasting we understand in this 
context every judgment, the accuracy of which is a random 
event with a probability known to the degree of exactness suf- 
ficient for practical purposes. It follows from this definition 
that a scientific forecast is always a statistical hypothesis. 
Scientific forecasting is impossible without comparing dif- 
ferent quantities, without measuring, without using numbers. 
For this reason contemporary economics makes use of math- 
ematics 1 to a growing extent; especially useful are the 



1 An exhaustive survey of applications of mathematics to economics 
can be found in studies [2], [581. 
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theory of probability, mathematical statistics and econo- 
metrics. 

The knowledge of causal relations existing between a given 
phenomenon and other phenomena is indispensable for scien- 
tific forecasting. If the relation between them is very strong 
it can be presented as a mathematical function. In the so- 
cial sciences, as a rule, we do not deal with functional re- 
lationships. This is due to the complexity of the nature of the 
phenomena studied. The relationships between them are usually 
of a stochastic nature (see p. 2). A specialized branch of 
mathematical statistics dealing with stochastic relationships 
is the theory of regression and correlation (which we shall 
also call correlation analysis). 

So far, correlation analysis has not been very extensively used 
by economists in their normal work. It is not difficult to 
explain why this is so. If a research method is to find a wide 
range of applications it is necessary for it to be: 1) sufficiently 
universal, i.e. suitable for solving a large number of different 
problems; 2) not too difficult, and easily popularized. We 
shall discuss the second of these conditions in Chapter 3. 
Here we shall try to show that the first condition is satisfied: 
that economics provides many interesting problems which 
can be solved only by correlation methods. 

Since the times of A. Cournot, political economy has em- 
ployed functions in its research with ever greater frequency 
and daring. Cournot himself, being a mathematician, believed 
that economics, like mechanics, can freely use the concept 
of a function without the necessity of concerning itself unduly 
with the exact form of the function. Every function can be 
accepted as given, on the assumption that such a function 
actually exists in real life and, one way or another, can al- 
ways be determined when the need arises. This point of view 
has been accepted by other economists such as Gossen, Pareto, 
Marshall and Keynes, to name just a few. This standpoint 
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is acceptable if it is remembered that the curves used in 
mathematical economics are, in fact, regression curves when 
looked upon from a statistical point of view because the 
variables employed in economics are not ordinary, but random 
variables. This is the reason for a great number of applications 
of correlation analysis to economics. We shall not confine 
ourselves to this general justification of the suitability of 
correlation methods for economic research, but shall also 
discuss the more important fields of their application. 

Before we do so, however, let us say a few words about 
econometrics. Among the problems with which this science 
deals is the elaboration of numerical research methods for 
use in economics. The first place among them is occupied by 
statistical methods. Here is what Tintner has to say, in his 
book "Econometrics": "Econometricians also make use of 
statistical methods to test certain hypotheses^ about the 
unknown population. This procedure is useful in the testing 
and verification of economic laws" (see [57], p. 18). The 
above sentence does not define the subject of econometrics 
but pertinently indicates with what this science deals. And 
here is how the founders of the Econometric Society interpret 
its scientific tasks: 

Econometric Society is an international association whose 
objective is the development of economic theory in conjunc- 
tion with statistics and mathematics (see [61], p. 5). 

In the majority of works on econometrics we find examples 
of the application of statistical methods to economics. Cor- 
relation analysis, and particularly regression theory are the 
most useful. Mathematical economics treats the economic 
categories with which it deals, as mathematical variables, and 
analyses functional relationships between these variables. 
When a dependence between a pair of variables is considered, 
it may be presented as a curve. The curves with which mathe- 
matical economics deals may be divided into three groups. 
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To the first group belong the curves describing the relation- 
ship between a pair of economic variables, e.g. the relationship 
between demand and price, costs and production, consump- 
tion and income, etc. 

The second group consists of the curves describing the 
relationship between an economic variable and time. They are 
called time curves. And finally, to the third group belong the 
curves describing the relationship between an economic variable 
and a technological variable. We deal with this type of relation- 
ship when we study the effect of the quality of raw materials 
on the cost of production or on the productivity of labour. 
The third group of curves we shall call techno-economic 
curves. 

We shall discuss here the most important curves belonging 
to the first group: want curves, personal income distribution 
curves, demand curves and cost curves. Curves belonging to 
the second group will be considered together in the section 
entitled "Time Curves". Curves belonging to the third group 
will also be discussed together in the section called "Techno- 
Economic Curves". 



2.2. More important applications of regression and correlation 
theory in economic research 

2.2.1. Want curves 

In order to live, man has to satisfy his various wants. Speak- 
ing most generally, these wants may be divided into physical 
and spiritual wants. To satisfy his wants man has to develop 
appropriate kinds of activities. One of the forms of such activ- 
ity is labour. Labour creates use values, i.e. provides the 
objects of nature with the ability to satisfy human needs. 
In a society where the division of labour is well developed, 
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people receive objects necessary to satisfy their needs on the 
market in exchange for money. The more money they have, 
the better they can satisfy their needs. The desire to satisfy 
one's needs manifests itself externally as the pursuit of money. 

The feeling of displeasure that a man sometimes experiences 
when his needs are not satisfied is the driving power inducing 
him to develop an activity leading to the satisfaction of his 
needs. The lack of such a feeling is tantamount to the lack of 
wants. 

Human wants are, as a rule, unlimited. However, the means, 
that people have at their disposal for the satisfaction of their 
needs are limited. This results in a continuous conflict consist- 
ing in the necessity of making a decision as to which of the 
wants is to be satisfied. 

Let S be the sum of financial resources which person Z 
has at his disposal to meet his needs during period T. Let 
Qi, 63 be various needs of this person, and S l9 5" a ... the 
amounts required to meet needs Q l9 Q%... Amounts S l9 S 2 ... we 
shall call the prices of needs Q 19 g a ... 
The following inequality is obviously satisfied: 

s<2* 

i 

and person Z can satisfy only some of his needs. The choice 
must be made so that 



where k denotes the number of needs satisfied. 

In choosing those needs that are to be satisfied from the 
amount S, person Z, if he behaves rationally, must act so as to 
minimize the displeasure caused by the inability to satisfy all 
his needs. 

Among the most important human wants are those which 
are necessary to his existence. These wants we shall call basic 
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wants. If the sum S that person Z has at his disposal is small, 
then it will be spent entirely on his basic wants. Until they 
are satisfied person Z has no freedom in choosing the wants 
to be satisfied. Let us denote by S' the minimum amount of 
money necessary to keep Z alive during period T under con- 
ditions that are not detrimental to the existence and develop- 
ment of his body. In this case, assuming that Z behaves ratio- 
nally, sum S will be spent entirely on the satisfaction of his 
basic needs unless S is greater than S". 

Suppose that at a certain moment t person Z has at his 
disposal the sum S S', and that at moments t +rT (r = 
1,2,...) he will receive amounts S r , increasing with r. Then, 
at the moment t +rT 9 person Z will have at his disposal 
the amount S r = *S"+S r ". The sum S r " is the amount that 
Z has at moment t r after meeting his basic needs during 
period T. The sum S"' we shall call the sum of free decision. 

Let us consider what behaviour should be expected from 
people when the sums of free decision that are at their dis- 
posal increase. Two types of situations may develop here: 

1) The whole sum S" will be earmarked to satisfy always 

the same, permanent group of needs. 

2) As S" grows, new needs will be satisfied. The group of 

satisfied needs will change and expand with the growth 

of S". 

We know from experience that the first type of situation 
rarely occurs in practice. In conformity with the psychological 
law 1 the displeasure related to an unsatisfied need decreases 
as the need is satisfied. This means that as personal incomes 
increase people spend money from the sums of free decision at 
their disposal to satisfy new and different needs and do not 
always spend these sums for the satisfaction of the same needs. 



1 Called Gossen's First Law. 
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In accordance with experience it should be assumed that 
as personal income increases the number of new and differ- 
ent needs that people can satisfy also increases. This means 
that the number of wants felt depends on the sum of free 
decision. This may be presented on a graph. 

On the horizontal axis we have the sum of free decisions 
S" which is at the disposal of a single person. The vertical 
axis represents the sums S t which, on the average, are earmarked 
by the person who has at his disposal the sum of free deci- 
sions S", to satisfy need Q t . 

The bisector of the angle between the coordinate axes is 
the locus of points whose ordinates represent the maximum 
amount of expenditure for the satisfaction of needs at a given 
value of S". 

GRAPH 1. 




As S" increases, new needs or groups of needs are satisfied. 
The sums spent on the satisfaction of needs Q t are increasing 
functions of S": 



The functions are represented on the graph by curves. As the 
need is satisfied, as S" increases, the sums S t asymptotically 
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approach straight lines parallel to the horizontal axis. The 
distance between these lines measured on the vertical axis 
represents the maximum amounts of sum S i that are ear- 
marked, on the average, by a single person for the satisfaction 
of certain needs. This means that if commodity Cj satisfies 
need Q 1 then the total amount of this commodity that can be 
absorbed by the consumers, when the price of commodity 
C^ is given and equals p l9 cannot exceed a certain constant 
value which depends upon the number of persons who pos- 
sess sums of free decision *S"' enabling them to satisfy need Q v 

If the sum of free decision 5" = 5 ", then the ordinate of 
the point located on the bisector and determined by S^', repre- 
sents the maximum amount that can be spent on the satis- 
faction of needs by a person whose sum of free decision 
S" = SQ (the ordinate in question is represented on the graph 
by segment MR). The points of intersection of the ordinate 
with particular S curves determine the amounts of expend- 
itures incurred on the average by the person possessing 
the sum of free decision S" = S^ for the satisfaction of his 
needs. 

Particular curves S i which we shall call want satisfaction 
curves, or want curves run one above the other, according to 
the priority of wants, i.e. according to their intensity. Wants 
that we find it difficult not to satisfy are represented by curves 
located low on the graph. As S" increases new needs appear; 
they are represented by curves located higher up. 

The segment determined on the MR ordinate by two ad- 
jacent curves S i and $ l+l reflects the average expenditure 
incurred for the satisfaction of need Q l+1 by a person for 
whom S" = S ". 

Segment MN represents the average total sum of expend- 
itures for the satisfaction of the needs of a person for whom 
S" = S?. 

Segment NR= MR MN represents the average amount 
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saved during a certain period of time by a person for whom 

Stt ritt 

= o . 

The knowledge of curves S t is of great practical importance. 
These curves enable us to determine the expected amount 
of expenditures incurred by person Z for the satisfaction 
of particular needs depending upon the sums of free decisions 
at his disposal. They also enable us to determine the amount 
saved by person Z. This means that knowing the distribution 
of personal income and curves S % we know the character of 
social demand for goods and services. 

From the statistical point of view, curves S t are regression 
curves reflecting the relationship between the expenditures 
for the satisfaction of particular needs and the sums of free 
decisions. 

The determination of regression line parameters is the 
task of statistics. These parameters are determined on the basis 
of statistical material which should be as complete as possible, 
collected in the course of making statistical observations. 

The analysis of the relationship between the amount of 
maintenance expenditures and the size of income was started 
by Engel. Analysing family budgets, he noticed that the per- 
centage share of maintenance expenditures decreases with an 
increase in income. This relationship is known in the literature 
as the Engel-Schwabe Law (see [61], p. 47). Analysing the 

GRAPH 2. 
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structure of family budgets, Engel derived an equation for 
the curves determining the stochastic relationship between 
the sum of maintenance expenditures and the size of income. 

These curves are known as Engel curves. It is particu- 
larly worth noting that Engel curves may be approximated 
by straight lines with an accuracy sufficient for practical pur- 
poses. Engel curves shown on Graph 2 were determined 
by Allen and Bowley [3] on the basis of German statistical 
material for the years 1927-28. 

As can easily be seen, we have here a case of a very clear 
linear correlation. The study of Engel curves has shown 
that they can be approximated by linear regression lines within 
their interval of validity 1 . 

Engel curves are a special case of want satisfaction curves. 
The term "want curves" was used by the author for the 
first time in study [30]. It is worth noting that since Engel's 
time a lot of attention has been devoted to the analysis of 
family budgets. On the basis of information derived from 
these budgets, research is being carried on concerning the 
relationship between the expenditures for the satisfaction of 
particular wants and the size of income. It has turned out 
that the terminology connected with these studies is not 
uniform. H. T. Davis in study [11] uses the term Engel curves 
for want curves. W. Winkler in study [61] objects to the use 
of this term and calls them Othmar Winkler's curves. Keynes 
[34] calls them consumption curves. A similar confusion 
developed with regard to the term used in the literature to 
describe the situation when consumption growth is slower 
than income growth. This economic phenomenon besides 
the term "Engel's Law" is also called a "declining pro- 
pensity to consume". If, after Keynes, we denote consump- 



1 The interval of validity of a regression line is a numerical interval 
containing all measured values of the random variable of the sample 
appearing in the equation of regression as an argument 
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tion expenditures by C and income by Y we can express the 
relationship between consumption and income in the form 
of the following equation: 



The derivative - (see [34], p. 114) Keynes calls the 
dY 

marginal propensity to consume. On the same page of this 
work we read: 

"Our normal psychological law that, when the real income 
of the community increases or decreases, its consumption 
will increase or decrease but not so fast, can, therefore, be 
stated... in a formally complete fashion... AY W >AC W ..." 

The above statement has been quoted here to show that, 
regardless of certain differences of a secondary nature, the 
main idea expressed by Engel's Law and the "psychological 
law" of Keynes is the same. It is interesting to note that a 
similar idea is expressed by Gossen's First Law: "The intensity 
of a given need steadily decreases as it is satisfied until the 
level of saturation is reached" (see [26], p. 4). 

Without becoming involved in a criticism of the conclu- 
sions derived by various economists from the laws quoted 
above we can say that all three laws express the following 
important economic truth, binding both in a capitalist and 
in a socialist economy: individual wants weaken as they are 
satisfied. 

To conclude our considerations concerning numerous appli- 
cations of want satisfaction curves we shall mention an inter- 
esting study by Wald [60] devoted to the problem of deter- 
mining indifference surfaces. In his work Wald assumes that 
when three commodities x, y and z are considered the indif- 
ference function W(x 9 y,z) is a second order polynomial of the 
type W (x,y,z) == a QQ +a ol x+a 02 y+a Q3 z+a lI x 2 +a l2 xy+a lz xz+ 
+^zy 2 +^yz+a^z 2 9 where all the coefficients a i> tf 2 , #33 
can be found if the equations for the Engel curves are known. 
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The constant term 00 , however, cannot be determined. This 
constant term is not of great importance since it is necessary 
to know the indifference surface equation only in order to 
determine the equations of the indifference curves (isoquant 
equations), and these do not depend on a Q0 . 

It follows that want curves, regardless of their names, are 
of fundamental importance to all contemporary mathematical 
economics. A number of important concepts and methods 
used in mathematical economics could certainly be used in 
the political economy of socialism. For this to happen, how- 
ever, it is necessary to expand the studies of family budgets 
since the material they provide enables us to learn about the 
relationship between the amount of expenditure for the satis- 
faction of particular needs and the size of income. Regression 
lines are a statistical "way of expressing this relationship (see 
4.3., Example 1). 

2.2.2. Income distribution curves 

In the preceding section the relationship between the 
expenditures for the satisfaction of particular needs and 
the size of income has been described. This is undoubtedly 
one of the most important and interesting economic relation- 
ships. Naturally it does not appear in isolation but is causally 
related to other dependencies which form the whole, complex 
economic system. 

For the full utilization of information supplied by want 
curves it is necessary to know the distribution of income. By 
the distribution of income of the population we shall under- 
stand a statistical relationship between the size of income and 
the number of people in a given income bracket, or the relation- 
ship between the size of income and the frequency of its appear- 
ance. These formulations are actually equivalent to one 
another; the formal difference between them consists in the 
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fact that in one case we deal with frequencies and in the other 
with relative frequencies. 

Below is an example of the distribution of income per 
person taken from the Statistical Yearbook for 1956 ([65], 
pp. 284-285). Table 1 shows the distribution of monthly 
earnings for September 1955. 

TABLE 1 

EMPLOYMENT ACCORDING TO MONTHLY EARNINGS FOR 
SEPTEMBER 1955 

The number of employed persons in particular classes of gross earnings 

(in zlotys) 



1 



8 



up to 
400 


401 
to 600 


601 
to 800 


801 
to 1,000 


1,001 
to 1,500 


1,501 
to 2,000 


2,001 
to 2,500 


2,501 
to 3,000 


over 
3,000 



in absolute figures (thousands) 



181-9 625-1 1,017-8 979-2 1,516-3 565-9 200-2 76-4 57-4 



For each particular wage group there is a corresponding 
figure showing the number of employees whose earnings fall 
into this class. 

Table 2 also shows the distribution of monthly earnings 
in September 1955. It differs from Table 1 only in that rela- 
tive frequencies have been substituted for frequencies. 

A graph called a frequency histogram is a graphic presen- 
tation of the distribution. One condition that has to be ful- 
filled if a graph is to be drawn is that the class intervals of the 
frequency distribution be equal. Since the distribution shown 
in Table 1 has unequal class intervals then in order to draw 
a graph we have to calculate cumulative frequencies cor- 
responding to particular class intervals of the frequency 
distribution: 
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up to 
400 


401 
to 600 


601 
to 800 


801 
to 1,000 


1,001 
to 1,500 


1,501 
to 2,000 


2,001 
to 2,500 


2,501 
to 3,000 


over 
3,000 


181-9 


807-0 


1,824-8 


2,804-0 


4,320-3 


4,886-2 


5,086-4 


5,162-8 


5,220-2 



TABLE 2 

EMPLOYMENT ACCORDING TO MONTHLY EARNINGS FOR 
SEPTEMBER 1955 



l 



8 



up to 
400 


401 
to 600 


601 
to 800 


801 
to 1,000 


1,001 
to 1,500 


1,501 
to 2,000 


2,001 
to 2,500 


2,501 
to 3,000 


over 
3,000 



relative frequencies 



3-5 



12-0 



19-5 18-8 29-0 10-8 



3-8 



1-5 



1-1 



cumulative frequencies 



3-5 



15-5 35-0 



53-8 I 82-8 93-6 I 97-4 98-9 100-0 



Below is a graph showing cumulative frequencies based 
on Table 2. 

GRAPH 1. 
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The cumulative frequency curve is shown on the graph as 
a broken line. This is the result of dividing statistical mate- 
rial into classes. If the classification of statistical observa- 
tions and the preparation of frequency distributions could 
be avoided, the curve of cumulative frequencies would cer- 
tainly be different. It would give a true picture of earnings 
in the month of September 1955 and would be free from 
distortions caused by the classification of statistical data. 
Treating the income of the population as a continuous variable, 
we can assume that the cumulative frequency curve would 
also be continuous. Its shape could be guessed in different 
ways. The simplest way would be to construct a polygon of 
relative frequency and to smooth out by hand the broken 
line obtained in this way. This is a really good and simple 
method but it is used rather reluctantly because: 

1) it involves a certain degree of arbitrariness in drawing 
a curve; 

2) it does not provide an equation of the curve; 

3) it is not conducive to probability reasoning. 

For these reasons analytic methods are preferred for 
smoothing out curves. Although they require many cum- 
bersome computations they are free from the last two of 
the above-mentioned drawbacks. They also make it possible 
to determine the curve in the "best" way, i.e. with proper 
consideration given to a maximum or minimum condition 
such as that the sum of the squared deviations of the variable 
from the resultant curve be a minimum. 

A smoothed out curve of cumulative frequency enables 
us to guess the shape of the frequency distribution curve. 
Indeed, a cumulative frequency distribution curve is nothing 
but an empirical distribution. Hence, using the known formula: 



we can guess the shape of the distribution curve with a fairly 
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good approximation. On Graph 1 the distribution curve is 
presented as a broken line. 

An analysis of personal income, like an analysis of family 
budgets, should be based on current research data providing 
statistical material that can be used to determine particular 
want curves and the income distribution curve. 

Let us denote, by V the inccme per person, and by W i the 
consumption per person of commodity A i expressed in phys- 
ical units, where i is the number of a commodity. In this 
case (K, W^) is a two-dimensional random variable. Let /== 
(v 9 w l9 t) denote the empirical distribution of this variable. This 
distribution changes in time, is a function of time. Let t l9 
/ 2 ,... t r ... be consecutive moments in time and let t r t r x = At. 
For a sufficiently small interval At we can assume that 
prices on the market are constant. If for every given moment 
t r supply S { is greater than demand D t and if the distribution 
is known for all /'s then we have enough information to decide 
how demand changes with price. In a planned economy this 
would provide sufficient information for solving the problem 
of how to fix the volume of production to ensure market 
equilibrium. Unfortunately in practice this is not possible. 
There are too many goods and services to make it feasible 
to carry on statistical research on each of them in order to 
determine the function f(v 9 w i9 t). The processing of statistical 
material pertaining to income and expenditures per person 
and a comprehensive analysis of this material require a lot 
of time and work. We can learn about only those relation- 
ships and dependencies which are of greatest importance to 
the national economy. The knowledge of the shape of the 
income distribution curve and of the want satisfaction curves 
is of special importance in a planned economy. Let us discuss 
this matter in greater detail. Let us assume that we know 
the shape of the income distribution curve and the shape 
of the want satisfaction curve for A (for example, A could 
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be sugar). In this case for each size of income there are two 
corresponding figures: the number of persons in this income 
bracket or the relative frequency with which this income ap- 
pears, and the average degree of satisfying need A, i.e. the 
average consumption of sugar by persons in this income 
bracket. On Graph 2 the relationship between the size of 
income and the average consumption of sugar per person 
is presented. Variable V is the argument and stands for the 
size of income; variable W denotes the average consumption 
of sugar per person. The dependence of W on V is described 
on the graph by the regression curve \v = <p(v). 

GRAPH 2. 




About a dozen points are distinctly marked on the curve. They 
are determined on the basis of statistical data. From these 
points perpendicular segments are drawn to the plane VOW. 
These segments represent the relative frequencies of the occur- 
rence of particular magnitudes of income per person. Graph 2 
is a three-dimensional graph. The relative frequencies shown on 
it are denoted by /(v) and measured along the vertical axis. 
It can be clearly seen from Graph 2 that if we know the 
shape of want satisfaction curve A and the shape of the in- 
come distribution curve we can easily determine both the 
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average consumption per person of commodity A correspond- 
ing to a particular income group, and the total consumption 
of this commodity in particular groups. This enables the 
authorities to see to it by using an appropriate price 
and wage control apparatus that particular needs of 
the population are satisfied to a sufficiently high degree, 
with special consideration given to the protection of the 
interests of those in the lowest income groups. 

If we assume that people in different income groups do not 
differ from each other with respect to the intensity of desire 
to satisfy want A 9 i.e. that people whose income is v /c and 
consumption is w k would consume n^u if their incomes in- 
creased to v fc+1 then it follows from Graph 2 that knowing 
the income distribution curve and the want satisfaction curve 
we can predict by how much the total consumption of com- 
modity A will increase when the earnings of population group 
k grow from v k to v fc+1 . Denoting by A the total increase in the 
consumption of commodity A> and by N the total population 
we get the following equality: 



(see Graph 2). 

The knowledge of the want satisfaction curve and of the 
income distribution curve allows us to predict how the demand 
for a given commodity will change in consequence of changes 
in income. The latter are not the only cause of changes in 
demand. Besides income, price essentially affects the size of 
demand. In our considerations so far we have assumed that 
prices are constant. This was permissible since our analysis 
was limited to a sufficiently short period of time At. When 
research on income, consumption, prices, demand and supply 
is carried on continuously we can disregard those periods 
of time At during which a change in prices occurs, similarly 
as we proceed in analysing a function with a finite number 
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of points of discontinuity. If economic research is conducted 
continuously then in every period At, the incomes, aver- 
age consumption and prices are known as is the demand. 
This is enough for the purpose of the current management 
of the economy. However, it is not enough for planning, or 
for the scientific forecasting of the course of the economic 
processes in the future. It is well known that for an economy 
to be in equilibrium it is required that at a given price of 
commodity A its supply be equal to the demand for it. Sup- 
pose that during a certain period of time the following situa- 
tion prevails on the market: the supply of commodity A is 
small, the demand for it large and the price high. This price 
considerably exceeds the social cost of producing commodity A. 
This means that the production of this commodity is very profit- 
able. If the economy is based on the principle of profitability 
this situation is bound to provide a stimulus to increasing 
production. An increase in production cannot occur instan- 
taneously but requires a certain period of time. Thus the 
need for accurate scientific forecasting stems from the fact 
that the adjustment of supply to demand does not take place 
directly, as was the case in a primative economy, but through 
the market. As long as the market exists, so long will scientific 
forecasting be needed, regardless whether the market is in a 
capitalist or a socialist economy. 

If a forecast of changes in demand is to be accurate we 
must know the relationship between demand and price as 
well as the relationship between demand and size of income. 
The knowledge of the relationship between demand and 
price allows us to answer the question: what will the probable 
demand be at a given price and what will the probable price 
be at a given demand? An accurate answer to this question 
is of great importance both in planning production and in 
fixing prices. To provide a correct answer it is necessary to 
know the demand curve. We shall now discuss these curves. 
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2.2.3. Demand curves 

As we know, there is a distinct interdependence between 
price and demand in a free market economy; when the price 
rises demand drops, when the price declines demand 
increases. In this case 

D = V(P), (1) 

where P stands for the price of a commodity and D for the 
amount that can be sold at this price. Naturally, a functional 
description of complex relationships that exist between par- 
ticular phenomena, or economic processes is always to some 
extent a scientific abstraction. In fact, the relationship between 
demand and price is not of a functional, but of a stochastic 
nature. This means that it is possible to express this relation- 
ship in mathematical language only by statistical methods. 
Functions that are used in mathematical economics are a tool 
of learning only when they can be statistically verified. All 
theoretical utterances about these functions are actually only 
scientific hypotheses until their correctness is checked by 
statistical methods. They become laws only after statistical 
verification. 

We have made these comments because in many textbooks 
on economics no justification is given for representing the 
relationship between demand and price by a concave and 
monotonically downward sloping curve (see Graph 1). 

GRAPH J. 
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Very little can be said a priori about the shape of the 
demand curve. All that we know is that it should fall as the 
price increases. The shape of the demand curve can be deter- 
mined only by statistical methods. The demand curve, from 
a statistical point of view, is the regression line of D on P. 
This means that in order to express the relationship between 
price and demand in a functional language it is not enough to 
write down a function inverse to function (1), but that we 
have to find the other regression line, i.e. the regression line 
ofPonD. 

In a socialist economy in which the monetary commodity 
exchange system prevails, an analysis of demand not only 
does not lose its importance, but acquires new significance 
which is essentially different from the significance it has in 
a capitalist economy. The main purpose of analysing demand 
in a socialist economy is to learn about the needs of the society 
and to adapt the production apparatus to the best possible 
satisfaction of these needs. 

Let us denote by W the average consumption per person 
of commodity A, by V income per person, and by P the price 
of commodity A. Between the random variable W and the 
random variables V and P there is a stochastic dependence. 
A certain defined value w corresponds to each pair of numbers 
(v,/?). Suppose that as a result of statistical analysis we have 
obtained numerical material on the basis of which we have 
constructed a three-dimensional model of relationships bet- 
ween random variable W and random variables V and P 
(see Graph 2). Along the K-axis the centres of class intervals 
of the income distribution series are measured, and along 
the P-axis the prices that have been observed during study 1 . 
Along the W-axis the average consumption per person of 
commodity A is measured. The segments perpendicular to 



1 Segments on the K-axis are not in the same units as those 
representing price intervals. 
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plane VOP drawn from the middle of each square represent 
the volume of the average consumption per person of com- 
modity A for various incomes and prices. The model would 
not be complete if it did not take into account the distribu- 
tion of the random variable (VPW). Since in this model the values 
of the distribution function constitute a fourth variable we 
cannot draw the required number of coordinate axes in a 
three-dimensional space. We have overcome this difficulty 
by presenting on the graph the values of the empirical distri- 
bution function (i.e. frequencies) as squares of different areas 
located inside the squares of the chessboard. Let us note 
that by fixing the price and moving along the F-axis 
we find the relationship between the amount of the average 
consumption per person of commodity A and the size of 
income. This relationship, expressed by a regression line, is 
the want satisfaction curve of commodity A. By fixing 
the size of income and moving along the P-axis we find the 
relationship between the consumption per person of com- 
modity A and the price. The regression curve describing this 
relationship is called the demand curve. If by D we denote 
the demand for commodity A, then 

D(v,p)=W(v,p).f(v,p).N, (2) 

where D(v,p) denotes the demand for A on the assumption 
that the price of this commodity is p and the income per person 
is v; similarly W(v,p) denotes the average consumption per 
person of commodity A if V = v and P = p, and if /(v,/?) 
denotes the frequency of random variable (V,P) at the point 
(V = v, P = p). N stands for the total population. 

The above graph presents the distribution of the the three- 
dimensional random variable (VPW). Suppose that this 
distribution is known. In this case, other things being equal, 
we also know the dependence of the demand for commodity 
A on both the size of income per person and the price of this 
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commodity. Thus we have enough information to be able to 
adjust the supply to the demand for commodity A, i.e. to satisfy 
the economic equilibrium requirements. 

The knowledge of the distribution of the random variable 
(VPW) allows us to solve many important economic problems. 
Let iis consider one of them. Suppose that the production 
and import potential does not permit us to saturate the market 
for commodity A to a degree sufficiently high to meet the 
demand for it at the constant price P = p . This means that, 
since the economic equilibrium requirements have been im- 
paired, in addition to the rigid price p fixed by the govern- 
ment, a new market equilibrium price p l (or rather a black 
market price) will appear. This price will shift on to the 
shoulders of the society the burden of maintaining speculators 
and smugglers and will constitute a temptation for dishonest 
employees of the socialized trade apparatus to hoard the scarce 
commodity. To prevent such a development, extremely harm- 
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ful from an economic point of view, the State regulates the 
price of commodity A by trying to fix the price at a given 
distribution of income per person in such a way as to 
ensure the sales of the stock of commodity A, and at the same 
time make revenues a maximum. To regulate prices without 
causing undesirable social and economic results it is necessary 
to decide: 

1) how to determine the equilibrium price of commodity A; 

2) to what extent the demand for this commodity wilt be 
satisfied in population groups of different incomes per 
person. 

We know that if at price P = p the demand exceeds the 
supply, the equilibrium price p : will be higher than price p Q . 
The higher pricey will cause a drop in demand, thus adjusting 
the latter to the available supply. Of course, the drop in 
demand will occur because less well-to-do population groups 
will be forced to satisfy a smaller amount of their wants in 
consequence of the high price (too high a price of commod- 
ity A). Each forced renunciation of the satisfaction of social 
wants is an undesirable development from an economic point 
of view. However, it may sometimes be tolerated as a neces- 
sary evil if its social and economic consequences are not 
dangerous. They may be dangerous when basic needs are not 
satisfied. They should not be dangerous, however, when the 
restrictions affect the satisfaction of the needs for luxury 
goods. Naturally, if the interests of poorer people are to be 
protected, we have to know the process of satisfying wants 
in population groups of different income brackets, their 
respective purchasing powers and the market demand for 
commodity A. 

Answers to these questions are facilitated by Graph 3. 

In this graph we can see what the distribution of the amount 
s of commodity A would be at the price P = p f . The areas 
of the bases of the parallelepipeds shown on the graph are < 
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f(v,p). The heights W(v,p) of these parallelepipeds represent 
the average consumptions of commodity A, or, what amounts 
to the same thing, the average individual satisfactions of 
need A. Hence the volume of a parallelepiped, equal to the 
product W(v t pj). f(v t pj) t multiplied by the population number 
N, represents the demand for commodity A that exists in the 
population group whose income is V = \\, when price P = p j 
(see formula 2). The total demand for commodity A equals 
the sum of the volumes of particular parallelepipeds multi- 
plied by N. This means 

ffyj,,). (3) 

It follows that, other things being equal, we can answer 
the questions concerning the size of the demand for com- 
modity A when price P = p, the price of this commodity 
when the supply S of this commodity equals s, and the distri- 
bution of the amount s of commodity A among the popula- 
tion groups whose incomes are v ls v 2 , ... If we know the distri- 
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bution of variable (VPW) we can solve the equations of the 
three regression surfaces. These equations are presented below 
in a general form 

v = Zi(P,w), (4) 

= ft(v,w), (5) 

w = g3(v,P)- (6) 

These formulae express in a functional way the relation- 
ships between three economic quantities: the average con- 
sumption per person of commodity A, the price of commod- 
ity A, and the average income per person. The determina- 
tion of equations (4), (5) and (6) is very difficult in practice. 
This is due to the complex nature of economic phenomena 
which are interrelated by strong causal relations. In the course 
of observing economic phenomena we try to consider them 
in isolation. This approach always tends to diminish the 
accuracy of the results of observations and economic anal- 
ysis because of the strength of causal relations existing be- 
tween economic phenomena. In order to improve the accu- 
racy, we must consider in the process of learning ever new 
relationships between the phenomena, trying to bring them 
into focus one by one, according to the diminishing strength 
of their influence. 

If in equation (2) we substitute the expected value 
w = g 3 (v,/?) for W(v,p) then the equation 

d(v,p) = g*(v,p).f(v,p).N (7) 

will represent the expected size of the demand for commodity A 
on the assumption that income V = v and price P = p. 

Formula (7) expresses explicitly the dependence of demand 
on income and price. These two economic factors undoubt- 
edly exert the greatest influence on the size of demand. How- 
ever, they are not the only factors. If income and the price 
of commodity A are determined, the demand for it may change 
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depending on changes in the prices of the complementary 
goods of, and substitutes for, commodity A. Since the deter- 
mination of all the relationships of this type is practically 
impossible we either take into account only the most impor- 
tant relationships or content ourselves with an analysis of the 
relationship between demand, income and price. 

Equation (5) shows the dependence of the price of commodity 
A on the average consumption per person of this commod- 
ity and on income. 

In a free market economy the supply of commodity A 
depends on price. 

Let us introduce new variables U,Y 9 Z, where U denotes 
sales revenues, Y costs and Z profit. In this case 

U=X.P (8) 

and 

U=Y+Z (9) 

(symbol X denotes the volume of production). 

At given values P = p 9 Y = y and Z = z the supply 



P 

In an economy governed by the principle of profitability 
profit is an economic factor providing an inducement to 
increase production. Of course, under these circumstances 
production should be increased until profit reaches its maxi- 
mum value. 

To decide how high the volume of production should be to 
obtain maximum profit we have to know the relationship be- 
tween costs and production. The knowledge of this relationship 
is a condition for a rational management of production. It 
follows from the equation 

i _ 

(ID 
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that at given values s and z the price of commodity A de- 
creases when the cost of its production decreases. Since 
a drop in the price of commodity A results in an increase 
in the demand for it the lowering of the cost of production 
leads to a better satisfaction of wants. 



22.4. Cost curves 

As we know from mechanics, the ratio of the amount of 
energy received from a given machine to the amount of energy 
supplied to it is called the coefficient of efficiency and is de- 
noted by the symbol r\. Thus 



where E 1 stands for the energy produced and E 2 for the energy 
used. 

Graph 1 shows the curve for coefficient rj as a function 
of variable E v This variable represents the output of energy 
produced by the machine studied. 

GRAPH 1. 



It -follows from the characteristic shape of curve 77 that if 
the productive capacity of the machine is not fully utilized 
the coefficient of efficiency 77 is low. As production increases 
the efficiency of the machine rises and eventually reaches its 
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maximum point. The abscissa of this point represents the 
optimum size of production E = J lo . At this level of produc- 
tion the efficiency of the machine is the highest. If produc- 
tion increases beyond the optimum value E^ the efficiency 
of the machine decreases since it is overloaded and conse- 
quently operates under unfavourable conditions. 

It follows from the shape of curve r\ that the curve repre- 
senting the dependence of variable E 2 on variable E l must 
be a monotonically increasing curve. 

This curve rises rapidly at first, then a little more slowly 
and then rapidly again (see Graph 2). 

GRAPH 2. 




If instead of a single machine (a boiler, engine or gener- 
ator) we consider a whole enterprise then production X will 
be an equivalent of variable E 19 and cost Y will be an equiva- 
lent of variable E 2 . Hence we can postulate the hypothesis 
that the shape of the total cost curve is similar to the shape 
of the curve shown on Graph 2. In other words, this curve 
may be considered as a hypothetical total cost curve. We have 
used the term "hypothetical curve" since in reality both vari- 
able X (production) and variable Y (cost) are random vari- 
ables. The joint distribution of variable (X, Y) is different 
for each enterprise. In order to learn about the shape of the 
regression line describing the relationship between cost and 
production we have to carry out appropriate statistical 
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GRAPH 3. 




production 



research in the enterprise which interests us. As we shall see 
later, the regression line is usually a straight line. 

On Graph 3 the total cost curve is shown. On this graph 
we can also see a portion of a straight line which does not 
differ much from the curve between the points marked by two 
vertical dashes. 

We shall denote the total cost by the symbol Y. On Graph 4 
the hypothetical shape of curve Y is shown. It appears from 
the graph that costs are incurred even when the production 
equals zero. The amount of this cost is represented on the 
graph by the segment determined on the positive part of the 
Y-axis by the point of intersection of curve Y with this axis. 



GRAPH 4. 



GRAPH 5. 




production 



(According to Paulsen [43]). 

c 
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These costs are known in literature under different 
names, e.g. fixed costs, or independent costs. The fixed cost is 
denoted on Graph 4 by the symbol y^. If we deduct the 
fixed cost from the total cost and if we divide the difference 
obtained by the volume of production we get the variable 
cost or dependent cost. The variable cost is denoted by the 
symbol 7,. Thus 

y 

* y. ~~ 



X 

In a geometrical representation the variable cost is the 
tangent of angle a (see Graph 4). 

By dividing the total cost by the^volume of production we 
get the average, or unit cost which in Graph 5 is denoted by 
the symbol Y. In this case 

y 
~~x' 

The average cost is equal to the tangent of angle f) (Graph 4). 
If the total cost y is a continuous function of the production 
and if at every point of a certain interval within which this 
function is determined, the derivative of this function is 

-\rt (Y\ 

then y' is called the marginal cost. Marginal cost is equal 
to the tangent of angle y between the line tangent to the 
curve and the horizontal axis (Graph 4). 

Variable cost, average cost and marginal cost are all func- 
tions of production. On Graph 5 we can see three curves 
representing these functions. The cost curves shown on 
Graphs 4 and 5 are not only graphic presentations of the 
interdependence between total cost on the one hand, and 
marginal, average and variable costs on the other; they are 
also a valuable tool of research. On the basis of these curves 
we can make several important observations. Assuming that 
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the hypothesis concerning the shape of the total cost curve 
is true, i.e. that: 

1) the curve is located in the 1st quadrant of the coordinate 
system; 

2) the ordinate of the curve at the point where the abscissa 
equals zero is not negative; 

3) the curve is continuous" within the whole interval of its 
validity and has a derivative at each point of this in- 
terval; 

4) the curve has one point of inflexion separating the con- 
vex from the concave part; 

the correctness of these observations follows directly from the 
graph. We can prove their validity in a formal way. 

Observation L The minimurnjnarginal cost is lower than 
the minimum variable cost which, in turn, is lower than the 
minimum average cost: 

min Y' < min Y z < min Y. 

Observation 2. The minimum variable cost is determined 
by the point of intersection of the marginal cost curve with 
the variable cost curve (point B on Graph 5). 

Observation 3. The minimum average cost is determined 
by the point of intersection of the marginal cost curve with 
the average cost curve (point C on Graph 5). 

Cost curves allow us to find correct solutions to the prob- 
lem of the "optimum production size". 

Assuming that an economy is based on the principle of 
profitability the optimum size of production means the size 
at which total profit is a maximum. Some economists con- 
sider that the optimum size of production can be determined 
with the help of the unit cost curve. They maintain that the 
optimum level of production is one at which the average cost 
is a minimum and thus presumably profit is a maximum. In 
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spite of its apparent correctness this statement is not true. 
We can easily prove the following: 

THEOREM 1. Total profit reaches its maximum value when 
marginal revenue equals marginal cost, or, which amounts to 
the same thing, when marginal cost equals price. 

Proof. Let U denote revenue, X production, P price, 
Y cost and Z profit. In this case 

Z= U- Y. 



dU dY __ 
~dX ~~dX~~ 



If profit Z is to be a maximum it is necessary that 

dZ 

~dX 
i.e. 

V = Y'. 
But 



Hence 
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Graph 6 is a geometric representation of Theorem 1. It 
follows from the graph that profit attains its maximum value 
at A"= x l9 and not at X = x . 

This means that if we determine the volume of production 
in such a way as to minimize unit cost, the profit is less than 
it would be if the production were X= x lf The correct op- 
timum production size is obtained by using marginal cost and 
not average cost. Both Theorem 1 and its proof have been 
known in economics since the days of Cournot. This theorem 
is valid in a socialist economy if the principle of profitability 
is observed. So far this theorem has not been applied con- 
sciously (it is, perhaps, applied to planning production in such 
a way as to maximize profit at a given price, but this is done 
somewhat unintentionally, more on the basis of experience 
and intuition, than of theory). It seems that the chief reason 
for this attitude is the reluctance of economists to use mathe- 
matical and statistical methods of research in their profes- 
sional work. 

Cost analysis is indoubtedly one of the most important 
and difficult problems facing the economist. The lower the 
social cost of production the higher is the social productivity 
of labour and the better the satisfaction of the needs of the 
society. In economic literature the subject of costs occupies 
the most prominent position. The determination of the equa- 
tion of a regression line which is an approximation to the 
total cost curve is a typical econometric problem. The prac- 
tical aspect of this problem is sufficiently important to justify 
dealing with it in greater detail. 

The human desire to satisfy wants induces people to pro- 
duce such goods as will satisfy them. The production of these 
goods requires sacrifices on the part of the society; it requires 
labour power, materials, power and all those factors of 
production without which the production process would be 
impossible. The society is willing to make these sacrifices only 
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because without them it is impossible to produce and thus 
to satisfy human needs. It follows that production is the only 
economic and social justification of the cost of production. 

It follows from economic considerations that there should 
be an interdependence between costs and production. In math- 
ematical economics it is assumed that cost is a function of 
production. The total cost curve is a geometrical represen- 
tation of this function. The representation of the relation- 
ship between cost and production as a mathematical function 
is, of course, a scientific abstraction. In real life neither cost 
nor production is a variable in the general sense, but each 
is a random variable. We can learn about the interdepend- 
ence between cost and production only by statistical re- 
search. The procedure leading to the knowledge of this inter- 
dependence has to follow a certain sequence. Each manu- 
facturing enterprise keeps a record of cost and production. 
This record provides periodically usually monthly sta- 
tistical data pertaining to the size of production and cost. If 
we denote production by X and cost by Y we can treat these 
quantities as the realization (x t , y t ) (i= 1,2,..., ri) of the two- 
dimensional random variable (X,Y). A point on a plane corre- 
sponds to each pair of numbers (x t , >>,.). A collection of such 
points may be regarded as a sample selected from an infinite 
general population. If it is true that there is an interdepend- 
ence between cost and production and if production and 
cost records are properly kept, the distribution of points 
on the scatter diagram will show a trend. The regression line, 
being a statistical representation of this trend, is a functional 
way of expressing the interdependence between cost and 
production. 

It is especially worth noting that, as numerous studies have 
shown, the regression line describing the relationship be- 
tween cost and production is usually a straight line. Let us 
quote a few opinions on this subject. Falewicz (page 61 of 
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his book quoted here) says: "In spite of the fact that theo- 
retically the line best representing the relationship between 
cost and production if it could be established for all possi- 
ble sizes of production from to the highest that the capacity 
of the enterprise permits would be a curve of an equation 
of a high probably not less than 3rd degree, in practice, 
when it is possible to study this relationship only within cer- 
tain limits of production size, we can assume, with a suffi- 
ciently high degree of accuracy that it can be represented by 
a straight line". 

And here is what Tinbergen has to say: "It has been estab- 
lished that in many industries the shape of the curve of total 
cost with respect to the volume of production can generally 
be represented by a straight line". Similar opinions are ex- 
pressed by Dean [12], Lyle [37] and many other statisticians 
who have studied the interdependence between cost and 
production. Very characteristic and to the point is a comment 
by Tintner. On page 49 of his book [57] we read: "It is remark- 
able that (in the relevant interval covered by the data) the 
total cost of making steel, seems to be a linear function of the 
amount of product. Hence the marginal cost is constant. 
The importance of this fact of constant short-run marginal 
cost discovered by all investigators of statistical cost func- 
tions contradicts the a priori assumptions of the econo- 
mists/' 

We have quoted the above opinions in order to show that 
in practice the regression line describing the relationship 
between cost and production is a straight line. This is of great 
importance since the determination of the linear regression 
parameters is relatively easy and, therefore, a statistical anal- 
ysis of the relationship between cost and production could 
and should be made widely known. 
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2.25. Time curves 



Time series constitute a wide field for applications of re- 
gression theory. It is well known that a trend is one of the 
characteristics of a time series. The notion of trend is inter- 
preted in the literature in a variety of ways. Below we describe 
two generally accepted interpretations. 

Interpretation 7. The following time series is given 



where t assumes integer values. An illustration of a time 
series is provided by the corn crop yields in the USSR in 
1922-1934. 

TABLE 1 
CROP YIELDS IN THE USSR 



Year 


Yields in metric 
quintals per hectare 


1922 


7-6 


1923 


7-2 


1924 


6-2 


1925 


8-3 


1926 


8-2 


1927 


7-6 


1928 


7-9 


1929 


7-5 


1930 


8-5 


1931 


8-7 


1932 


7-0 


1933 


8-8 


1934 


8-5 



(see [41], p. 171). 

The statistical data of Table 1 are shown on Graph 1. The 
broken line on this graph is called a time curve. It shows van- 
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ous irregular breaks which are a result of random factors. 
These breaks in the time curve not only do not help in the 
process of learning, but on the contrary, make it difficult 
to detect the influence of the regular factor which causes 
crop yields per hectare to show a tendency to increase. 

GRAPH 1. 
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According to the first interpretation a trend is a line express- 
ing a general tendency in the shape of a time curve. A trend 
line is determined by the elimination of random oscillations 
from the time curve. The parameters of the trend line are 
obtained by appropriate statistical methods, e.g. the moving 
average method, or the method of least squares. This inter- 
pretation of trend is accepted by O. Lange in his textbook 
on statistics [35] in which he writes: "Analysing time series 
we notice that they show a certain development tendency" 
(p. 181). And further on: "Table 46 gives yields in metric 
quintals per hectare in the USSR in 1922-1934; these data 
are shown on Graph 46 1 . A development tendency can be 
clearly seen. The yield per hectare fluctuates from year to 
year but on the whole there is, undoubtedly, an increase in 
yield... A development tendency may be emphasized by a 
procedure called the smoothing out of a time series" (p. 182). 

In this interpretation a trend line is a regression II line; 
by a visual inspection of the time series we select a family 



1 See Table 1, Graph I. 
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of approximation functions and determine the parameters 
of one of the functions belonging to this family. The curve 
of this function is the trend line. The role of the trend line in 
this interpretation is to smooth out the time curve. The value 
of the trend line as a tool of learning is much smaller than 
the value of the regression I line. In an analysis of the rela- 
tionship between two random variables X and Y when both 
are independent from time, the regression I line assigns con- 
ditional expected values of one variable to any values of 
the other variable. When we deal with a time series there are 
no conditional expected values involved. 

Trend is a smoothed out time curve. In the first interpre- 
tation, trend can be used only to describe a time series, but 
it cannot constitute the basis for a prediction concerning 
the development of a stochastic process in the future. Even 
the most careful extrapolation is not permissible. 

Interpretation II. The values of a time series are a realization 
of a certain stochastic process. This process may be subject 
to some law which can be described by an appropriate math- 
ematical function. The nature of this law, and consequently 
the shape of the function, are known. The parameters of the 
function, however, are not known. They can be estimated by 
statistical methods on the basis of the statistical material 
contained in the time series. In this interpretation the trend 
line is a geometric presentation of the function which is known, 
a priori, to be a mathematical expression of a law governing 
the stochastic process under consideration. 

This interpretation often leads one astray and results in 
mathematical formalism. The law governing a stochastic proc- 
ess is rarely known a priori 1 . Hence a temptation to proceed 
in the following way: on the basis of the visual inspection of 



1 In economic applications this law is sometimes known in the sta- 
tistical quality control of production. 
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the shape of a tim? curve we select an appropriate approxi- 
mation function and then argue "theoretically" that this 
curve expresses, in fact, a "law" governing the realization 
of a stochastic process. Such a temptation is particularly 
strong when a tim? curve shows only minor fluctuations, 
like the curve shown on Graph 2. It would seem that this kind 
of procedure is too obviously against common sense to be 
used. This is not so, however. O. Lange, in his book mentioned 
above, quotes two examples of an improper application of 
a logistic curve to smoothing out a time series. In both cases 
a logistic curve was used not only because it "fitted" well 
to the statistical material but primarily because it allegedly 
expressed the "law of growth" which can be presented in a 
mathematical form as a differential equation: 

- = x(a-x) g(t), (0 < x < a) t 
dt 

where a is a constant called the "level of saturation". 

The application of this equation to the analysis of the trend 
line is deprived of all economic justification. Both cases cited 
by Lange are examples of mathematical formalism; he empha- 
sizes this fact very strongly. 

The equation of a logistic curve is the integral of the differ- 
ential equation mentioned above and presumably express- 
ing the "law of growth". 

One of the examples given in Lange's book comes from 
"Theory of Econometrics" by H. T. Davis, who also wrote 
"The Analysis of Economic Time Series". The latter book 
prompted M. G. Kendall to express the following short but 
pointed opinion: "Davis's book on The Analysis of Econo- 
mic Time Series' (1941) contains a great deal of interesting 
material but should not be read uncritically" (see [33], p. 437). 
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(see [64]). 

It might be worth while to mention here a very pertinent 
comment on the subject of formalism by A. Hald: 

"The logistic curve has been frequently used to illustrate 
the growth of 'populations' (cells, human populations, 
telephone subscribers, etc.), the development of business trans- 
actions between different countries, education of persons in 
various manual and mental accomplishments, etc. Regarding 
most of these applications it may be said that the theoretical 
analysis of the growth process in hand is so uncertain that 
it is doubtful whether or not the process is governed by a 
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differential equation such as (20.7.2) 1 , wherefore application of 
the logistic curve is mainly based on its descriptive proper- 
ties. The results of the extrapolations regarding population 
figures, production, etc., which have been carried out on this 
basis should therefore be regarded with great scepticism" 
[28] 2 . 

J. M. Keynes expresses his opinion on the matter very 
frankly: 

"Too large a proportion of recent 'mathematical' eco- 
nomics are mere concoctions, as imprecise as the initial as- 
sumptions they rest on, which allow the author to lose sight 
of the complexities and interdependencies of the real world 
in a maze of pretentious and unhelpful symbols" ([34], p. 298). 
It would appear from the above quotations that if there are 
economists who use mathematics improperly, there also are 
those who see and criticize their mistakes. 

The more correct of the two interpretations of trend de- 
scribed above is the first. It follows from the definition given 
on p. 88, that from a formal point of view the trend line 
can be considered as a regression II line. All those who prefer 
the first interpretation of the trend line agree on this. How- 
ever, there is one doubt. The regression II line is a function 
g(x) for which 

E[Y g(x)] 2 = minimum (1) 

(see p. 36). With respect to the time series x i9 x%, ..., x t ... 
formula (1) will assume the following form: 

E[X g(t)] 2 = minimum, (2) 

1 i.e. by equation: 
dx 



(a-x)g(t), (0<x<a), 
x(a-x)g(t), 



cit 



2 A. Hald: Statistical Theory with Engineering Applications, New 
York, 1952, p. 661. 
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where g(t) is the symbol of the trend equation. Formula (2) 
states that the trend is a function of time g(t) for which the 
mathematical expectation of squared deviations of the values 
of the time series from the values of this function at moments 
/ 1,2,... is a minimum. The doubt mentioned above is due to 
the fact that it is not really known exactly how to interpret 
the notion of mathematical expectation with respect the ran- 
dom variable dependent upon time. With reference to a ran- 
dom variable which is independent of time, mathematical 
expectation is a distribution parameter of this variable. This 
parameter is a number. The situation is different when the 
random variable depends on time. In this case its distribution 
depends on time and consequently also the parameters of this 
distribution depend on time, are functions of time. Naturally 
this case is not covered by the definition of the mathematical 
expectation of a random variable independent of time. This 
means that if the trend is understood as a regression II line 
of a random variable correlated to time then there are certain 
aspects that require explanation. It should be stressed that the 
problem of the definition of a trend is, so far, an open prob- 
lem in the literature on statistics. For instance, Hald [28] 
mentions a textbook by Kendall [33] in the bibliography of 
studies related to time series. Indeed, this textbook can be 
considered as the most important one as far as time series 
are concerned because of the amount of space and attention 
given to this subject. However, even Kendall does not give 
a definition of a trend that is free from the reservations men- 
tioned above (see [33], p. 371). This fact has prompted the 
author to attempt to formulate such a definition. It is given 
in 6.1. 

2.2.6. Techno-economic curves 

Correlation analysis finds many applications in different 
branches of technology where it is often necessary to discover 
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relationships between various random variables. Among 
such relationships are: the relationship between the amount 
of gas or liquid sucked in by a suction pump and the degree 
of vacuum created in the pump, expressed in percentages; 
the relationship between the durability of alloys used for 
heat resistors and the temperature in which they operate; 
the relationship between the hardness of steel used for tool 
manufacturing and temperature or carbon content, etc. 

There is no need to give further examples of the applica- 
tions of correlation analysis to technological research. They 
are numerous and increase almost every day. It should 
be emphasized that studies on relationships in the field of 
technique very often have an important economic aspect in 
addition to a technical aspect. For instance, if some changes 
have been introduced in a technological process (i.e. in the 
process of manufacturing scarce goods) in consequence of 
technological research, these changes may, and usually do, 
have economic as well as technical effects. From an economic 
point of view these effects may be positive, neutral, or nega- 
tive. The criterion for this type of classification is found in 
production costs. 

Let us discuss the already-mentioned relationship between 
the durability of heat resistant alloys and the temperature in 
which they arc used. This type of research is conducted in 
connection with a search for the most durable alloys. In elec- 
trical engineering various alloys are used: constantan, man- 
ganin, nickeline, nichrome, chromel, alumel, kanthal and 
others. Each of these alloys has a different durability, differ- 
ent resistance to high temperatures, changes in -the frequency 
of heating, cooling, etc. Depending on the technical require- 
ments, one or another type of alloy is used. In making a 
choice economic consequences have to be taken into consid- 
eration. Some alloys can be produced at home and others 
have to be imported; for some of them the raw materials 
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are available at home; for some they are not. Costs of pro- 
duction are different for each alloy. The more durable alloys 
are usually more expensive. It follows that studies of the 
relationship between the durability of the alloy and tempera- 
ture are of interest not only to the technician but also to the 
economist. 

All examples of statistical relationships which have both 
technological and economic aspects we shall call techno-eco- 
nomic relationships. The curves which are a graphic represen- 
tation of these relationships we shall call techno-economic 
curves. An interesting example of techno-economic curves is 
presented on Graph 1, taken from a study by Vernon 
L. Smith (see [53]) 1 . 

GRAPH 1. 
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The regression lines shown on this graph describe the rela- 
tionship between the consumption of fuel and the weight of 
a car together with its load. Number R is a measure of the 



1 See also 3.2.2., Example 2. 
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slope of the road (the slope coefficient). It is a fraction ex- 
pressing the increase in height in feet per 100 feet of the 
lenght of the road. It can be seen from the graph how the 
consumption of fuel has decreased in consequence of technical 
improvements. The greater the value of R the smaller is the 
drop in the consumption of fuel. The relationship between 
fuel consumption and the weight of the car with a load is a 
technical problem, but its economic consequences are so 
clear and far reaching, that no comments are required. The 
regression curves shown on the graph can be used as a basis 
for setting fuel consumption norms. Much has been written 
in literature about the uselessness of "statistical" norms 
and the necessity of replacing them by "technical" norms. 
It can be seen from the above example that without the help 
of statistics (in this case correlation analysis) it would be 
difficult to set a technical norm. 



3. ESTIMATING LINEAR REGRESSION PARAMETERS 

3.1. General remarks about methods of estimating 

There are several methods of estimating the parameters 
of a general population on the basis of statistical data sup- 
plied by a random sample of the population. The most im- 
portant are: the maximum likelihood method, the minimum 
variance method, the minimum % z method, and the method of 
least squares. So far, only the method of least squares has 
been used in regression theory because of its many advan- 
tages. This method is easier to comprehend than the others 
since it requires only knowledge of how to find the maximum 
or minimum of a function by differential calculus, and it is not 
necessary to know mathematical statistics. The method of 
least squares is very general. It may provide solutions in cases 
when other methods have failed. For both these reasons the 
method of least squares is known by astronomers and sur- 
veyors, physicists and biologists, technicians and economists. 
Since a basic knowledge of calculus is necessary to learn the 
method of least squares, it is used almost exclusively by scien- 
tists. Practical workers rarely use it. 

The method of least squares has a very valuable formal 
quality, important in cases of linear regression. There is a theo- 
rem known as the Markoff Theorem [8] which states that 
estimates obtained by this method are consistent, unbiased 
and most efficient. In this theorem it is not assumed that 
the distribution of random variables is normal, or even that 
these variables are independent. 
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In spite of these unquestionable advantages of the method 
of least squares, there are two reasons why we here propose 
a new method of estimating regression parameters besides 
the method of least squares. Until it is finally given a name 
we shall call it the two-point method. However, it should be 
understood that this is only a temporary term. In cases of 
linear regression the two-point method also provides con- 
sistent and unbiased estimates of regression parameters, but 

1) computations of the values of estimates are much easier 
than those required in the method of least squares, and 

2) to learn the two-point method it is not necessary to 
know the calculations for maxima and minima required 
in the method of least squares. 

The efficiency of the estimates obtained by the two-point 
method is a little lower than the efficiency of the estimates 
obtained by the method of least squares, but when a sample 
is large this consideration does not carry great weight. The 
important advantages that are gained by the introduction 
of the two-point method in the theory of estimating linear 
regression parameters consist, first of all, in the fact that this 
method is conducive to the popularization of regression and 
correlation theory among practical workers. This is of partic- 
ular importance in economic research. In Chapter 2 we have 
discussed the most important applications of regression and 
correlation to economic research. These applications are di- 
verse and important to the economy. However, they will 
be of real service in expediting the control of economic pro- 
cesses only when correlation analysis becomes a handy tool 
of economic analysis, known and willingly used by those en- 
gaged in economic activities. The main obstacle to popular- 
izing regression and correlation methods among economists 
is the undoubtedly too high requirement of mathematical 
knowledge for the determination of regression parameters by 

7* 
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the classical method 1 . The two-point method is, to a large 
extent, free of these difficulties. Let us hope that many people, 
after they learn the two-point method and see the advantages 
in the application of statistical methods to studies of the 
relationships between random variables, will make an effort 
to improve their knowledge and gradually to master the classi- 
cal method: 

3.2. Estimating linear regression parameters by the method 
of least squares 

3.2.L The derivation of formulae. Examples 

All our further considerations concerning the two-dimen- 
sional variable (X 9 Y) will be based on the following assump- 
tions [28] 2 : 

1) the conditional distributions of variable Y 9 correspond- 
ing to any values of variable X, are normal distribu- 
tions; 

2) the regression line of Y on X in a general population 
is a straight line with the equation 

j> = ax + ft 

where a and /? are constant parameters; 

3) the conditional variance V(Y\X = x) is a constant; 

4) points (x t ,y t ) (/ = l,2,...,w) drawn for the sample, where 
n is the size of the sample, are stochastically independent. 

Let us denote by Q a two-dimensional general population. 
The random variable (X, Y) is defined by the elements of this 
population. From population Q we draw a random sample 
o> comprising n items. 



1 i.e. the method of least squares. 

2 A. Hald: Statistical Theory with Engineering Applications, New 
York, 1952, p. 528. 
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Problem. On the basis of the data from sample o>, estimate 
the parameters a and /? of the regression line j> = ax+fi in 
the general population Q. 

This type of problem is usually solved by the method of 
least squares. Let us denote by a and b the estimates obtained 
from the sample of the unknown parameters a and ft 
in Q. In order to determine the values of these estimates 
we have to minimize the expression 



After simple transformations (see 1.2.7.) we get a set of nor- 
mal equations: 



1=1 1=1 /-I 

J;K-<IJ;*,-& = O. 

/= i /= i 

Solving them with respect to a and b we obtain 

b = y ax, (1) 



. (2) 

(*,--*) 

1-1 

In formula (1) x and y are arithmetic means of the sample, i.e. 

1 " 1 " 

* = -jj** > = -2>'- (3) 

fl /ri n ft 

The equation of the regression line of the sample assumes the 
form: 

y = ax + b. (4) 
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Let us denote further 



= I/ 
I/ 



(5) 



These are standard deviations of variables X and Y of the sample. 

<toO = - (*,-*)(>>,-#. (6) 

w.-i 

This is the covariance of the sample. 

In this case the regression parameters of 7 on A" can be 
found in the following way: 

#20 = y a 2l x, (7) 



It is not difficult to notice the similarity of formulae (7) and 
(8) to formulae (6) and (8) from 1.2.7. 
Similarly for the regression of A" on Y we have 

(9) 
00) 



Parameters # 21 and a n we shall call regression coefficients of 
the sample. 

In case of the regression of Y on X the standard error of 
the estimate in the sample by analogy to formula (16) in 
1.2.7. is defined as follows: 



(11) 
Similarly, for the regression of X on Y: 



The sample correlation coefficient r is an estimate of the 
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correlation coefficient of the general, population Q. We shall 
define coefficient r by the formula 

r f =.fli, (13) 

by analogy to formula (6), 1.2.8. 

We shall illustrate by an example the method of determining 
numerical values of the estimates of regression line parameters 
obtained by the method of least squares. 

Example 1. Analyse the relationship between the consump- 
tion of compressed air Y and the amount of coal X extracted 
from mine Z. 

The relationship between the amount used and the volume 
of production expressed in the form of increased input con- 
sumption in consequence of increased production is a result 
of a causal relation existing between input consumption and 
production; the only economic reason for increased input 
consumption is increased production. The amount of con- 
sumption is influenced not only by the volume of production, but 
also by secondary causes such as differences in attitude toward 
work among employees during working hours, changes in 
the condition of equipment, differences in the quality of raw 
materials, damages to machines and many others. As a result 
of these causes, the relationship between the amount of input 
consumption and the volume of production appears to be of 
a stochastic nature. 

The consumption of compressed air is related to the use 
of machines and technical equipment needed primarily for the 
extraction of coal. The most typical are: hammer drills, drills 
and punching machines. The characteristic feature of such 
machines is that at the moment they stop operating the con- 
sumption of air used as operating power also stops. In this 
respect they are different from other machines driven 
by other sources of energy like steam or oil; these use 
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up substantial amounts of energy during their unproductive 
work. 

The monthly data on the volume of coal production and 
the amount of compressed air used up cover a period of three 
years. Both production and consumption are expressed in 
physical and not monetary units. This eliminates the distur- 
bances which might appear in the relationship as a result of 
price changes. The data on which the analysis of the relation- 
ship between the consumption of compressed air and the 
production of coal is based are shown in Table 1. 



TABLE 1 

CONSUMPTION OF COMPRESSED AIR AND THE PRODUCTION 
OF COAL IN A MINE 



Month 


1949 


1950 


1951 




x \ y 


X 


y 


X 


y 


1 


in 


18-5 


102 


18-1 


96 


16-2 


2 


98 


17-1 


93 


15-4 


94 


15-2 


3 


118 


18-4 


100 


17-9 


97 


15-4 


4 


104 


17-6 


100 


17-4 


83 


14-3 


5 


105 


17-7 


104 


18-1 


84 


14-5 


6 


104 


18-5 


104 


18-5 


103 


17-0 


7 


102 


18-7 


104 


19-5 


97 


16-2 


8 


108 


18-8 


108 


19-2 


101 


17-1 


9 


111 


19-2 


101 


17-3 


104 


18-5 


10 


107 


18-4 


103 


19-5 


102 


17-2 


11 


105 


17-4 


110 


19-1 


107 


18-3 


12 


91 


15-3 


106 


18-1 


94 


17-5 



x the extraction of coal in thousands of tons per month, 

y the consumption of air in millions of cubic metres per month. 



The corresponding scatter diagram is shown on Graph 1. 
A graph should always be made before the equations of the 
regression lines are computed, because it supplies initial 
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information about the nature of the relationship between the 
variables studied. This information enables us: 1) to form 
an opinion whether the relationship between the variables 
is strong or weak, 2) to choose a mathematical func- 
tion to serve as an approximation to the relationship 
between the variables. In Graph 1 the relationship between 
the consumption of compressed air and the extraction of 
coal is presented. Both regression lines are shown; their equa- 
tions are computed below. The distribution of points in the 
graph indicates a linear trend. We can see from the graph 
that the correlation in this case is positive since an increase 
in one variable is accompanied by an increase in the other. 
The correlation between the variables studied can be con- 
sidered fairly strong. This statement is based on observations 
indicating that the direction and magnitude of changes in 
consumption and production generally correspond to one 
another. In other words, when production increases, con- 
sumption also increases and the greater the increase in pro- 
duction, the greater the increase in consumption and vice 
versa, in most cases a drop in production causes a drop in 
consumption and the greater the former, the greater the latter. 
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We have completed reading the graph. Let us now deter- 
mine the parameters of the regression line. To calculate a , 
aw ^20 an d A 10 we have to find Z(xx) (y y); Z(xx) 2 ; 
Z(y-y) 2 l (see (7), (8), (9), (10)). Computations involved in 
determining these quantities are usually placed in a table 
(see Table 2). 

In the last row of the table, marked 27, we read: 

Z(x-x)* = 1,774, 
2^-30* = 65-14, 
Z( x - x) (y - y) = 280-24 - 1-9 = 278-34, 

Zx == 3,672, hence x = 102, 

Ey = 630-0, hence y = 17-5. 

Having the above data we calculate the parameters of both 
regression lines: 
Z 



Z(x-x)* 1,774 

In order to determine the dimensions of parameter # 21 , we 
insert into the formula 



Z(x-x)(y-y) .. 

_.A ---- LL ^/ dimension 



x thousands of tons of coal per month, 

y ~ millions of cubic metres of air per month. 

We obtain 

thous. tons of coal/month X mln.m s of air/month 
thous. tons of coal/month x thous. tons of coal/month 
thous. m 3 of air 

tons of coal 
i.e.: 

n i cr thous. m 3 of air , r , m 3 of air 
a 21 =0-156 ----- - = 156 ----- . 

tons of coal tons of coal 
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TABLE 2 
METHOD OF LEAST SQUARES APPLIED TO TABLE 1 
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No 


x 


y 


x x 


y y 


rx-> 


(y-y)* 


(x x) (y y) 


+ 


- 


+ 


- 


+ 


- 


1 


114 


18-5 


12 




1-0 




144 


1-00 


12-00 




2 


98 


17'1 




4 




0-4 


16 


0-16 


1-60 




3 


118 


18-4 


16 




09 




256 


081 


1 44 




4 


104 


17-6 


2 




0-1 




4 


001 


020 




5 


103 


17-2 


1 






0-3 


1 


0-09 




0-3 


6 


104 


180 


2 




05 




4 


0-25 


1-00 




7 


104 


185 


2 




1-0 




4 


1-00 


2-00 




8 


108 


18-8 


6 




13 




36 


169 


780 




9 


113 


19-4 


11 




1-9 




121 


361 


2090 




10 


107 


18-4 


5 




0-9 




25 


081 


450 




11 


105 


17-4 


3 






0-1 


9 


001 




0-3 


12 


93 


15-5 




9 




2-0 


81 


400 


1800 




13 


102 


18 1 








0-6 







036 








14 


93 


154 




9 




21 


81 


441, 


1890 




15 


100 


17-9 




2 


0-4 




4 


16 




0-8 


16 


102 


176 








0-1 







0-01 








17 


104 


18-1 


2 




06 




4 


0-36 


1-20 




18 


104 


18-5 


2 




1-0 




4 


1-00 


200 




19 


104 


19-5 


2 




20 




4 


4-00 


400 




20 


108 


192 


6 




1-7 




36 


289 


1020 




21 


101 


17-3 




1 




0-2 


1 


0-04 


020 




22 


103 


18-5 


1 




1-0 




1 


100 


1-00 




23 


110 


19-1 


8 




1-6 




64 


256 


1280 




24 


106 


18-1 


4 




0-6 




16 


036 


2-40 




25 


96 


16-2 




6 




13 


36 


169 


7-80 




26 


94 


152 




8 




23 


64 


529 


1840 




27 


97 


15-4 




5 




21 


25 


4-41 


1050 




28 


85 


14-3 




17 




32 


289 


1024 


54-40 




29 


84 


145 




18 




30 


324 


9-00 


5400 




30 


103 


17-0 


1 






05 


1 


0-25 




0-5 


31 


97 


162 




5 




1 3 


25 


1-69 


650 




32 


101 


17-0 




1 




05 


1 


0-25 


050 




33 


104 


18-5 


2 




10 




4 


1 00 


200 




34 


102 


17-8 








0-3 







0-09 








35 


107 


183 


5 




08 




25 


0-64 


400 




36 


94 


17-5 




8 








64 











Z 


3,672 


630-0 


93 


93 


19-3 


19-3 


1,774 


65-14 


28024 


1-9 










1 










27834 





x - 102, 



17-5. 
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We calculate 6 2 o : 

20 = y a 21 x = 17,500,000 m 3 of air/month 



156 3 f air 102,000 tons of coal/month 
tons of coal 

= 17,500,000 m 3 of air/month 15,912,000 m 3 of air/month 
= 1,588,000m 3 of air/month. 

Thus the equation determining the average consumption 
of air in relation to the extraction of coal has the following 
form: 

y = 1,588,000 m 3 of air/month + 156 m3 f a x. 

tons of coal 

The equation may be called a characteristic of consumption 
of compressed air. For any data concerning production, pro- 
viding they are taken from the interval of validity of the func- 
tion, this equation provides the estimate of the average con- 
sumption of air for a given volume of production. The interval 
of validity of the function lies between the lowest and the 
highest value of the random variable appearing in the regres- 
sion line equation as an argument. In our example this in- 
terval is: 

(84,000 tons, 118,000 tons). 

The regression line is a geometric representation of the 
input consumption characteristic; after Falewicz, we shall call 
it the line of normal input consumption. 

To satisfy their needs, people have to produce various 
goods. 

"The labour process, resolved into its simple elementary 
factors, is as we have seen, purposive activity carried on for 
the production of use-values, for the fitting of natural 
substances to human wants; it is the general condition requi- 
site for effecting an exchange of matter between man and 
nature; it is the condition perennially imposed by nature 
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upon human life..." ([38], p. 177). Thus, labour is a tribute 
that man pays to nature for her products. He tries for natural 
reasons to minimize this tribute; he tries to achieve his eco- 
nomic aims with the minimum of effort. This principle lies 
at the basis of all economic activities. 

We shall introduce the following definition of efficiency 
of the enterprise. By the efficiency we shall understand the 
totality of the activities of the enterprise aimed at the attain- 
ment of its economic objectives with the least outlay both in 
the form of "living" and "stored up" labour. The control 
of the efficiency of a socialist enterprise is one of the most 
important tasks of socialist economics. The normal input 
consumption line is an effective tool of such control. This line 
determines the average, the most probable, and thus the 
normal amount of input consumption corresponding to dif- 
ferent levels of production. If the consumption is lower than 
"normal" we can say that the enterprise has successfully 
raised its efficiency; if it is higher it means that the enter- 
prise has failed in its efforts to increase its efficiency 1 . 

Let us compute the values of the regression parameters of 
X on 7. It follows from formulae (9) and (10) and from Table 2 
that 

278-34 . 
= 4-27; 



65-14 
dimension # 12 : 

mln. m 3 /month x thous. tons/month 

mln. m 3 /month x mln. m 3 /month 

thous. tons tons 

----- =. v/UUl 



mln. m 3 m 3 



1 An extensive discussion of applications of linear regression to the 
economic control of an enterprise can be found in studies [23], [37]. 
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Hence 



= 4-27 . 0-001 _ = 0-00427 



m 3 of air m 3 of air 

In this case 

b 1Q = 102,000 tons of coal/month 



- 0-00427 -- . 17,500,000 m 3 of air/month 

m 3 of air 

= 27,275 tons of coal/month. 

The equation of the regression line which gives the average 
production of coal in relation to the amount of compressed 
air used, has the following form: 



x - 27,275 tons of coal/month + 0-00427 --- . y. 

m 3 of air 

The regression lines are shown on Graph 1. From the for- 
mal statistical point of view both regression lines are of equal 
importance but in an economic interpretation this is not so. 
The practical use of the regression line determining the most 
probable value of production corresponding to a given con- 
sumption of one production factor, is rather limited. On the 
other hand, there is great practical importance in an analysis 
of the relationship between the volume of production and 
several production factors. The method of multiple correla- 
tion is used for this type of analysis. 

The correlation coefficient is a measure of the degree of 
correlation between the random variables studied. Since it 
follows from our previous calculations that 

ir , m 3 of air , ^^ A ^ tons 

<i a = 156 - , and <z la = 0-00427 



, 
tons of coal m 3 of air 
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then on the basis of formula (13) 

r 2 = 0-00427 . 156 = 0-66612. 
HenCe r = 0-815. 

The correlation in this case is fairly strong. 

Example 2. Analyse the relationship between the average 
level of inventories and the cost. 

A socialist enterprise is an independent economic entity 
(within the framework of accounting regulations) which tries to 
fulfil its production targets in the most rational and econom- 
ical way. This tendency manifests itself in efforts to fulfil 
and surpass production targets, to observe the production 
schedules, to improve the quality of the product, to econo- 
mize and to lower the cost of production. To fulfil these 
tasks the enterprise is equipped with an appropriate amount 
of capital goods and liquid assets. It is desirable that the 
amount of both capital goods and liquid assets necessary to 
carry out production targets be as low as possible. It is diffi- 
cult to realize this situation in practice. The amount of capital 
goods needed is determined by an analysis of the effective- 
ness of the investments. In the determination of the liquid 
assets requirements, however, a study of the liquid assets 
turnover is involved. The purpose of an analysis of the rela- 
tionship between the average stock of liquid assets and costs 
is to determine the parameters of the regression equation 
which enable us to assign an appropriate amount of average 
stock to particular costs. 

The scatter diagram on Graph 2 describes the relationship 
between the average level of stocks and costs in a clothing 
factory. The regression line shown on the graph expresses 
this relationship statistically. On the Y-axis average stocks are 
measured in millions of zlotys, on the X-axis costs in mil- 
lions of zlotys per quarter. The scatter diagram is based on 
the statistical data shown below, comprising a period of three 
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years. The data come from quarterly accounting reports. The 
statistical material and the computations involved in deter- 
mining the regression parameters are shown in Table 2. 

TABLE 2 

AVERAGE LEVEL OF STOCKS AND COSTS IN A CLOTHING 
FACTORY 



TMr 






x-x 


v-y 


f v ^\Z 


U-jf)(v-y) 






y 


-1- 


_ 


+ 


- 


(X X) 


+ 


- 


1 


8-3 


12-6 




5-4 




1-8 


29-16 


9-72 




2 


10-2 


12-1 




3-5 




2-3 


12-25 


8-05 




3 


11-5 


12-9 




2-2 




1-5 


4-84 


3-30 




4 


12-2 


13-8 




1-5 




0-6 


2-25 


0-90 




5 


12-4 


13-1 




1-3 




1-3 


1-69 


1-69 




6 


13-7 


14-8 








0-4 












7 


14-6 


14-7 


0-9 




0-3 




0-81 


0-27 




8 


14-9 


15-3 


1-2 




0-9 




1-44 


1-08 




9 


16-0 


15-7 


2-3 




1-3 




5-29 


2-99 




10 


16-5 


16-0 


2-8 




1-6 




7-84 


4-48 




11 


16-6 


15-5 


2-9 




1-1 




8-41 


3-19 




12 


17-5 


16-3 


3-8 




1-9 




14-44 


7-22 




2 


164-4 


172-8 


13-9 


13-9 


7-5 


7-5 


88-42 


42-89 





x = 13-7, y = 14-4. 

Let us calculate the parameters of the regression line of 
Y on X: 

42-89 



8842 



= 0476, b 2Q = 144 - 0476 . 13-7 = 7-9. 



The regression equation that we are trying to find has the 
following form: 

y 7*9 million zlotys + 0476 quarters . x. 

In this example we are not trying to determine the regres- 
sion line of X on Y or the correlation coefficient. 
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The regression line is a good tool for appraising the average 
level of stocks; if the points corresponding to new reporting 
data appear above the regression line, this means assuming 
that the points belong to the same population on which the 
regression line is based that there was a set-back in the 
efforts to increase the turnover of liquid assets; and vice 
versa, if the points are below the line it indicates that the 
turnover of liquid assets has risen. 



GRAPH 2. 






10 11 12 13 14 15 16 17 18 

(million zlotys/ quarter) 



Correlation analysis can be applied to other problems 
related to the analysis of liquid assets. By studying the rela- 
tionship between the volume of production in a given period 
of time and the average level of warehouse stocks we could 
determine whether such a relationship exists and what its 
degree is. This would provide valuable material for setting 
stock control norms and appraising the efficiency of the 
merchandise control department. Another yardstick for meas- 
uring its efficiency is provided by a study of the relation- 
ship between the flow of incoming and outgoing warehouse 
stocks measured in predetermined periods of time. 
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3-2.2. The technique of computing regression parameters 
in a small and a large sample 1 . Contingency table 

The computation process connected with the determination 
of regression parameters in a small sample can be simplified 
by replacing the differences (x t x) and (y i y) by (x i u) 
and (y i - w) where u and w are certain constants selected 
so as to facilitate computation. 
Let us denote 

x= u + A U , y=w+A w . 
Since 

j; (x f - t/) 2 = j; [x, - (x - ^ u )] 2 - j; (x, - xp + 
/-I /-i /-i 

then 



; (x, - A-) 2 - ; (x, - w) 2 - w/i 2 . (i) 

i-l /=! 

In consequence of similar transformations it is easy to show 
that 



(2) 
and 

n n 

V fv- Y\ (\) +j\ V 1 (x IJL\ (v w\ n A A (^\ 

y v-^i */ V.M j) >^ \ A t **/ v^i " J ri ^- 1 uw* \ J J 

/- 1 i- 1 

Therefore 



(4) 



1 Samples comprising not more than 30 items we shall call small. 
Other samples we shall call large. 
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(5) 



When the sample is large the contingency table is often used 
for computing regression parameters (Table 1). The top row 
of the table contains the centres of class intervals of the distri- 
bution of y s and in the left column the centres of class inter- 
vals of the distribution of x's are shown. The bottom row 
contains the frequencies n. lt n. 2 , ..., rt. t of the distribution 
of y's and the extreme right-hand column the frequencies 
HI- Wj>., > HA- of the distribution of x's. 

TABLE 1 
CONTINGENCY TABLE 



"ai 



"ij 



77., 



n k . 



In the contingency table the frequency distribution in the 
sample is shown. The number n in the extreme lower right-hand 
panel denotes the size of the sample. 

Let us write down three important relationships following 
directly from the contingency table: 

,.= Y,,, (6) 



8* 
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/ J i J 

Assuming that the values of the random variable X and of the 
random variable Y which belong to particular classes of the 
frequency distribution are equal to the central values of these 
classes we get 

x = 2n if x lf (9) 

n i 

00) 



(11) 



% ,/*, -x)(y,- y) 

= ' J --- 



(13) 
(14) 

Computations connected with the application of formulae 
(11) and (12) may be simplified when the ranges of all the 
intervals for each variable are the same. Instead of x and j> 
we then introduce the new variables 

x u f n .. 

u = - - , (15) 
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where u' and w' are constants, d is the range of the class inter- 
val for the distribution of jc's and h is the range of the class 
interval for the distribution of y's. We have: 

x - x = u'+ud -u' -ud = d(u - u), (17) 

and similarly 

y-y = h(w-w). (18) 

Introducing (17) and (18) in (11) and (12) we get 



(19) 



(20) 



Now we have to take only one step to arrive at the formulae 
which are needed to compute parameters a 2 i and a 12 on the 
basis of the data in the contingency table. After simple trans- 
formations we have 



h 



V n i} u, Wj ww) 

t / 



^( y y n a u t w j ~ ww i j> _ \ 
- U / f / _^^-j^) 

" "' 



We shall illustrate the technique of calculating regression 
parameters by two examples. The first will show the calcula- 
tion of regression parameters from a small sample and the 
second from a large sample. 
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Example 1. An analysis and comparison of welfare in 
different countries is important and interesting, although dif- 
ficult. The difficulties arise because of differences in national 
mentalities, traditions, cultures and customs which cause 
such substantial differences in the average structure of wants 
in different countries that comparisons present a multitude 
of problems. The different price and wage ratios in the various 
countries studied and the necessity of rate of exchange com- 
putations magnify these difficulties. However, it is relatively 
easy to achieve at least some of the research objectives with 
the help of correlation analysis. The higher the level of wel- 
fare in the country the greater number of wants is included 
in the basic wants group (see 2.2.1.). The characteristic distin- 
guishing basic wants from others is the fact that the rela- 
tionship between the degree of satisfaction of these wants 
and income is rather weak. Food constitutes the most im- 
portant group of basic wants. In our further considerations 
we shall assume that the consumer in the country studied 
is able to find on the market every food product he demands. 
This assumption means that in all countries studied, the 
buying inducements to which the consumer is exposed with 
regard to food products are the same. Let us also assume 
that people in all countries if they had sufficient financial 
means at their disposal would satisfy their nutrition require- 
ments in such a way as to maximize their satisfaction. On 
the basis of these assumptions we can say that if incomes 
were sufficiently high people would satisfy their food require- 
ments in the best possible way, earmarking a sufficiently 
large portion of their income for this purpose. Since food 
requirements have been made optimum then the portion of 
income earmarked for food will not increase with a further 
increase in income. This means that, other things being equal, 
the correlation between the expenditures for food and the 
size of income becomes weaker as income increases. 
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We can surmise, therefore, that the correlation coefficient 
between the level of food expenditures and the size of the con- 
sumer's income is one of the welfare characteristics of a group 
of people. Of course, there are other yardsticks for measuring 
welfare. 

TABLE 2 

MONTHLY INCOME (jc) AND EXPENDITURES (y) IN 20 
FOUR-MEMBER LOWER SILESIAN FAMILIES 



VT 






x u 


y w 


fv A2 




(x u)(y w) 


ISO 


x 


y 










(x uy 


\y w) 


i 








-1- 





+ 









H- - 


1 


267 


141 




13 


11 




169 


121 




143 


2 


254 


159 




26 


29 




676 


841 




754 


3 


249 


112 




31 




18 


961 


324 


558 




4 


344 


152 


64 




22 




4,096 


484 


1,408 




5 


246 


119 




34 




11 


1,156 


121 


374 




6 


411 


207 


131 




77 




17,161 


5,929 


10,087 




7 


217 


114 




63 




16 


3,969 


256 


1,008 




8 


219 


118 




61 




12 


3,721 


144 


732 




9 


359 


152 


79 




22 




6,241 


484 


1,738 




10 


378 


150 


98 




20 




9,604 


400 


1,960 




11 


256 


135 




24 


5 




576 


25 




120 


12 


406 


160 


126 




30 




15,876 


900 


3,780 




13 


258 


117 




22 




13 


484 


169 


286 




14 


213 


84 




67 




46 


4,489 


2,116 


3,082 




15 


345 


129 


65 






1 


4,225 


1 




65 


16 


273 


164 




7 


34 




49 


1,156 




238 


17 


251 


76 




29 




54 


841 


2,916 


1,566 




18 


225 


126 




55 




4 


3,025 


16 


220 




19 


254 


149 




26 


19 




676 


361 




494 


20 


194 


113 




86 




17 


7,396 


289 


1,462 




275,619 


2,677 


563 


544 


269 


192 


85,391 


17,053 


28,261 


1,814 


















26,447 



x = 280-95, 
- 280, 
w = 130, 



. 133-85, 
= 0-95, 
' 3-85. 
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When the correlation coefficient is close to zero, food 
requirements are met in an optimum way; when it approaches 
unity, the satisfaction of food requirements is so poor that 
the possibility of starvation cannot be excluded. However, 
when the correlation coefficient is close to zero or unity it 
ceases to perform its function as a yardstick of welfare. If, 
for instance, in two countries A and B the correlation coeffi- 
cient is close to zero we cannot say that the level of welfare 
is the same in both, but we can say that it is so high that it 
allows the citizens of both countries to achieve an optimum 
satisfaction of their food requirements. To compare the level 
of welfare in the two countries we have to introduce another 
measure which takes into consideration their unsatisfied needs. 
Similarly, we cannot contend that the level of welfare is equally 
low if, in the two countries studied, the coefficient of correla- 
tion between food expenditures and income is close to unity. 
We can say that the standard of living in both countries is low. 
To decide in which it is lower and in which it is higher we have 
to obtain additional information. 

The correlation coefficient should be used with care in meas- 
uring welfare. We should remember that we are measuring 
a complex phenomenon which depends upon many factors. 
If we heed this warning the correlation coefficient will be a use- 
ful tool for measuring the welfare of a nation. 
In accordance with formulae (4) and (5) we have 

26,447 - 20 . 0-95 . 3-85 26,374 



85,391 -20. (0-95) 2 85,373 
26,374 26,374 



r 2 = 



17,053 -20. (3-85) 2 16,757 
26,374 2 



85,373 . 16,757 
r = 0-69. 
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Column x in Table 2 shows the average monthly size of income 
in tens of zlotys; column y of this table contains the statistical 
data on average monthly food expenditures, also in tens of 
zlotys. The statistics pertain to twenty four-member families 
living in the Lower Silesian area. 

With reference to this example let us make the following 
observation: although the calculation of the correlation coef- 
ficient is expedited by the method of least squares, it is a time- 
consuming operation. 

Example 2. One of the most difficult problems that con- 
fronts the clothing industry is that of deciding the range 
of sizes of clothes in order to ensure a good fit for 
a large number of people. Each size range is characterized 
by a set of several numbers. The problem consists in 
assigning these numbers to particular characteristics in such 
a way as to obtain an appropriate combination. Until 1955 
the clothing industry used to solve this problem in a very 
simple way: a well-proportioned woman and man would 
be selected as a typical representative of the majority of 
Poles, and ready-to-wear clothes were made according to 
their measurements. This method resulted in the production 
of clothes that could be worn only by a small number of 
people. Warehouses were overstocked with a large number 
of unsaleable products. In 1955 anthropologists and mathema- 
ticians were called in to help solve the problem. The anthro- 
pologists have taken about 85 thousand anthropometric pic- 
tures of men, women and children. The results were analysed 
by mathematicians under the direction of Professor Hugo 
Steinhaus. The sample was large enough to provide reliable 
information about the measurements of the whole population. 
In order to select an appropriate set of characteristics for the 
model the correlation was calculated for pairs of such char- 
acteristics as: height, chest, waist, shoulder, neck, arm meas- 
urements. The characteristics selected had a low degree of 
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correlation with one another and strong correlation with other 
characteristics. 

Table 3 is a contingency table containing statistical data 
and computations connected with the determination of the 
coefficient of correlation between the chest measurements and 
the height of 500 men selected at random from all the men 
included in the anthropometric studies 1 . 

The above example is a good illustration of how technolog- 
ical and economic problems are interrelated. 

The preparation of models is a technological problem but 
its consequences have an economic aspect. A poorly constructed 
model results in ill-fitting clothes which nobody wants to 
wear and consequently in the waste of thousands of metres 
of expensive nuterial and in thousands of hours wasted by 
tailors. Without correlation analysis it would be difficult to 
find proper measurements for the models. This shows how 
useful and valuable correlation analysis can be for practical 
purposes if it is skilfully used. 

The contingency table contains all the data necessary for 
the computation of the correlation coefficient. Thus we have: 

w = = 0-016, 
500 

^ = . 87 _ = o-174. 
500 






634 l 


MIV 
i7z 


500 " ' 
L232 _ 


vttt 


500 
2 ' 497 4-QQ4. 



500 

1 These statistics were obtained through courtesy of Professor Adam 
Wanke. 
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h=2 9 d=4. 

Hence, on the basis of (21) and (22) we get: 
2 1-268-0-003 1-265 



2-464 4-928 

4 1-265 2-530 



= 0-258, 



= 0-506. 
2 4-994 4-994 

Therefore 

r 2 = 0-258. 0-506 = 0-130548. 
And finally 

r= 0-361. 



3.3. Estimating linear regression parameters by the two-point 
method 

3.3.1. The derivation of formulae 

In section 3.1. we gave a brief justification for introducing 
into regression analysis a new method of estimating regres- 
sion parameters, which we called the two-point method. It is 
easy to master and convenient to use. We shall now describe 
this method. 

Let us denote by Q - as in 3.2.1. a two-dimensional 
general population. A pair of values (x,y) of random variable 
(X,T) corresponds to each item of this population. We assume 
that the regression I lines of the general population are straight 
lines, i.e. 

f = anX + P* (1) 

and 

x = a 12 x + /3 10 , (2) 
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where a 2 i, a 12 , j8 20 an d ^10 are regression parameters of the 
general population. We take from this population a random 
sample co comprising n items. We get n pairs of numbers (x i9 y t ) 
(i = 1,2, ..., n) corresponding to the items drawn. These num- 
bers can be interpreted as the coordinates of points located 
on a plane. Such a random point corresponds to each item 
of population Q. 
We compute: 



1 " 
v ~- V v 

n ^ 
n ,^i 

We divide set CD into two subgroups in such a way that we 
include in the first subgroup the points with abscissae X not 
greater than x, and into the second subgroup all the re- 
maining points. If in the second subgroup there are k points, 
then in the first there will be n k points. Let us note that 
in this division of set co into two subgroups, quantity A: is a ran- 
dom variable which may assume the values 1,2,..., n 1. 
Let us denote 



<i> Y \ X ^ x, 



We compute 



-jS x 
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The following theorem can be proved: 
THEOREM 1. 



1 



-=0. 



(4) 



The proof of this theorem is given on p. 213 of the Appendix 
at the end of the book. 

It follows from Theorem 1 that the three points (3c (1) , j<i>), 
C*(2>> y<2>)> (x 9 ~y) He on one straight line. As in the estimate 
of parameter 21 , we are proposing to accept the slope of this 
line; it can be expressed by any one of the following three 
formulae: 



a - 

"21 ~ 



y ~" y<: 

% = 

X Xn 



(5) 
(6) 
(7) 



An estimate of parameter /? 20 can also be expressed in one of 
the following three ways: 

(8) 
(9) 
620 = ? -<*nx. (10) 

In order to obtain analogous formulae for estimates of 
parameters a 12 and /? 10 we have to divide the set of points co 
into two subgroups in such a way that in the first subgroup 
are points with ordinates Y not greater than y, and in the 
second, all remaining points. 
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Let us further denote: 






Below are the definitions of other symbols: 

3><D = ---- J> ( i)> J(2> = y ( * } , 

n m m 

Xi> = ------- V x <i>9 x&> = V .v<2> 

n m ^ m 

Letter m stands for the number of points which are in the 
second subgroup as a result of the division of the a> into two 
subgroups. Of course, m is a random variable which may as- 
sume the values l,2,...,n 1. By interchanging letters in 
formulae (510) we get the formulae for the regression para- 
meters of X on Y. 

3f2> x 

i2 = -i~ -- * (12) 

j( 2) - y 

, (13) 



y - *t) 

^I^. (14) 



O7) 
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It follows from the definition of these regression parameters 
that all that is required to determine the position of the re- 
gression line by the two-point method is to know how to draw 
a straight line through two points. When we want to determine 
the position of the regression line of Yon X we draw a line through 
any two of the three points (5r (1) , J^u), (3c (2) , J<2>)> (x> J>)- Simi- 
larly when we want to determine the position of the regression 
line ofX on Y we draw a line through any two of the three points 
(7(i) > *<i>) G> ( 2) *< 2 >), (y, x). 



3.3.2. The technique of computing regression parameters 
in a small and a large sample. Examples 

It is easy to use the formulae given in 3.3.1. We shall 
illustrate this by two examples. In the first example the statis- 
tical material covers a period of two years. The regression 
parameters have been calculated by two methods: by the 
method of least squares and by the two-point method. This 
will enable us to show the advantages of using the two-point 
method. Comparing the computation tables for the two meth- 
ods we can see that the two-point method is simpler than 
the classical, easier to comprehend and to compute. In 
Example 2 illustrating the determination of regression para- 
meters by the two-point method in a large sample, we shall 
not calculate these parameters by the method of least 
squares. However, to make possible in Example 2 a compari- 
son of the two methods and of the results obtained by them 
we shall use the statistical data of Example 2 from 3.2.2. 

Example 1. Table 1 contains data on the number of car- 
kilometres driven and the number of kWh used up by the 
cars of the City Transport Corporation in Wroclaw (monthly 
data). 
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KILOMETRES DRIVEN AND KWH USED BY ELECTRIC CARS 
IN WROCLAW 



No 


X 

car-km 


y 
kWh 


No 


X 

car-km 


y 

kWh 


1 


1,162,697 


885,999 


13 


1,327,516 


1,055,148 


2 


1,033,608 


803,399 


14 


1,221,159 


961,312 


3 


1,093,926 


819,134 


15 


1,372,107 


1,091,060 


4 


1,080,507 


788,863 


16 


1,302,451 


1,056,694 


5 


1,209,917 


857,770 


17 


1,401,363 


1,092,946 


6 


1,128,658 


867,890 


18 


1,495,300 


1,094,767 


7 


1,201,090 


917,318 


19 


1,498,257 


1,060,927 


8 


1,215,048 


953,802 


20 


1,503,663 


1,046,036 


9 


1,190,704 


955,560 


21 


1,479,019 


1,258,528 


10 


1,242,228 


996,482 


22 


1,575,782 


1,133,920 


11 


1,212,823 


865,628 


23 


1,597,701 


1,184,790 


12 


1,252,190 


882,888 


24 


1,617,143 


1,237,667 



We want to calculate the regression parameters by the 
method of least squares and by the two-point method: we 
shall start by rounding off the figures to the nearest ten 
thousand car-kilometres and ten thousand kWh. 

Below is shown the sequence of computing regression para- 
meters by the method of least squares: 
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TABLE 2 
METHOD OF LEAST SQUARES APPLIED TO TABLE 1 





X 


y 


x~u 


y w 






(x-u)(y-w) 


XTn 


lOthous 


ICthous 






(xuP 


( y _ w )2 






car-km 


kWh 


+ 


- 


+ 


- 






+ 


- 


1 


116 


89 




14 




11 


196 


121 


154 




2 


103 


80 




27 




20 


729 


400 


540 




3 


109 


82 




21 




18 


441 


324 


378 




4 


108 


78 




22 




21 


484 


441 


462 




5 


121 


86 




9 




14 


81 


196 


126 




6 


113 


87 




17 




13 


289 


169 


221 




7 


120 


92 




10 




8 


100 


64 


80 




8 


121 


95 




9 




5 


81 


25 


45 




9 


119 


96 




11 




4 


121 


16 


44 




10 


124 


100 




6 








36 










11 


121 


87 




9 




13 


81 


169 


117 




12 


125 


88 




5 




12 


25 


144 


60 




13 


133 


106 


3 




6 




9 


36 


18 




14 


122 


96 




8 




4 


64 


16 


32 




15 


137 


109 


7 




9 




49 


81 


63 




16 


130 


106 








6 







36 







17 


140 


109 


10 




9 




100 


81 


90 




18 


150 


109 


20 




9 




400 


81 


180 




19 


150 


106 


20 




6 




400 


36 


120 




20 


150 


105 


20 




5 




400 


25 


100 




21 


148 


126 


18 




26 




324 


676 


468 




22 


158 


113 


28 




13 




784 


169 


364 




23 


160 


118 


30 




18 




900 


324 


540 




24 


162 


124 


32 




24 




1,024 


576 


768 




















i 




Z 


3,140 


2,387 


188 


168 


131 


143 


7,118 


4,206 


4,970 





w>=100, 



=0-699, tfi 2 = M80, 
2 = 0-699. M80 = 0-83, 
r = 0-91. 
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In the following table the computation of regression para- 
meters by two-point method is shown: 

TABLE 3 
TWO-POINT METHOD APPLIED TO TABLE 1 



No 


X 


y 


_ *L 


_^ > 


7(2) 


x <t> 


1 


116 


89 










2 


103 


80 










3 


109 


82 










4 


108 


78 










5 


121 


86 










6 


113 


87 










7 


120 


92 










8 


121 


95 










9 


119 


96 










10 


124 


lOOv 






100 


124 


11 


121 


87 










12 


125 


88 










13 


133* 


106v 


133 


106 


106 


133 


14 


122 


96 










15 


137* 


109v 


137 


109 


109 


137 


16 


130 


106v 






106 


130 


17 


140* 


109v 


140 


109 


109 


140 


18 


150* 


109v 


150 ; 109 


109 


150 








i 




19 


150* 


106v 


150 


106 


106 150 


20 


150* 


105v 


150 


105 


105 


150 


21 


148* 


126v 


148 


126 


126 


148 


22 


158* 


113v 


158 


113 


113 


158 


23 


160* 


118v 


160 


118 


118 


160 


24 


162* 


124v 


162 


124 


124 


162 


r 


3,140 


2,387 


1,488 


1,125 


1,331 


1,742 



= 130-8, y - 99-5, 



-- 148*8, j;<,>- 112-5, > u) - 110-9, 

fl.ii = 0-72, tfia = 1-26, 

r s = 0-72. 1-26 = 0-91, 

r = 0-95. 



145-2, 



9* 
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Comparing Tables 2 and 3 we can easily see that the computa- 
tions connected with the determination of regression para- 
meters by the two-point method are much simpler and less 
time-consuming than those required for the method of least 
squares. 

Let us explain the sequence of computations that have been 
made in order to fill in Table 3 and to find the values of re- 
gression parameters: 

1) the figures in columns x and y have been added and the 
arithmetic means calculated 



24 24 

2) in column x the numbers greater than x= 131 have 
been marked with a *; there are ten of them; 

3) the marked values of column x have been written down 
in column Jt (2) , and the corresponding values of y in 
column j< 2 >; 

4) in column y the numbers greater then y = 99-5 have 
been marked with a v ; there are twelve of them; 

5) the marked values of y have been written down in column 
y (2)9 and the corresponding values of x in column * <2> ; 

6) the following averages have been calculated: 

1,488 _ 1,125 

x (1) = -^- = 148-8, y <2> = _ = 112-5. 

= 145-2; 

7) using formulae (5) and (12), a 21 and a n have been cal- 
culated : 

.9 1== 



148-8 - 130-8 

145-2 - 130-8 , 

a ia = ------- = 1-263. 

110-9-99-5 
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Knowing the estimates of the regression parameters we can 
estimate the value of the correlation coefficient. We have: 

r 2 = 21 . 12 = 0-722 . 1-263 = 0-9119. 
Hence: 

r = 0-95. 

To fill in the computation table in the two-point method 
does not require any calculations. We simply write down in 
the appropriate columns the numbers marked * and v , and 
their corresponding "joint" 1 numbers. 

It is not necessary to subtract, square and multiply the 
numbers with different signs as was the case when the method 
of least squares was used. If the assumption about the linear 
character of correlation between variables X and Y is valid, 
both methods give approximately the same results, as can be 
seen from our example: 

GRAPH 1. 




100 110 120 130 140 

car- kilometres driven (tens of thousands) 



1 The word "joint" is used here in the following sense: since we deal 
with a two-dimensional random variable for every abscissa x t there is 
a joint ordinate y lt and vice versa, for every ordinate y t there is a joint 
abscissa x t . 
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On Graph 1 are shown two regression lines determined by 
the method of least squares (broken lines), and two regression 
lines determined by the two-point method (continuous lines). 
There is very little difference in the location of the lines ob- 
tained by the two methods, although the dispersion of points 
is considerable. 

Example 2. On the basis of the data contained in Table 3 
of 3.2.2., calculate by the two-point method the value both of 
the regression parameters and of the correlation coefficient. 

The solution of this problem is facilitated by Table 4. In 
addition to the symbols already denoted and defined there 
are some new ones: 

n <v> the frequency of variable X& 

~y 

**<%> 99 99 95 99 yl <2> 

#<!> 99 99 9-> 99 * 1> 

*'<2> 99 99 99 99 * <2> 

It follows from the general form of Theorem 1, 3.3.1. (see 
note on p. 214) that set a) can be divided into two subgroups 
not only by using numbers 3c and y, but also by using any 
numbers x l and y^ that satisfy the inequalities 

*min ^ X l ^ *max 
>Wn < J 1 < Jmax- 

It was assumed in Table 4 that 

^=172, ^ = 91. 

The division of set o> into subgroups was marked in the 
table by two thick lines perpendicular to one another and 
intersecting the middle part of the table cross-wise. 

Let us compute the arithmetic means appearing in the 
formulae for the regression coefficients. All the information 
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needed for calculating these averages is provided by Table 4. 
Thus we have: 

297 

3c (1) = 172 -- - 4= 168-22, 
(} 314 



= 172 + -- . 4 = 178-52, 
186 

391 
= 91 - - 2 = 88-30, 

= 91 + -^--2 =95-52, 



5^ = 91 _i2L.2=90-32, 
314 

193 

J <2> = 91 + -^- 2 = 93-08, 
186 



3f <1>== 172 -- -.4= 170-81, 
289 



3c <2 > = 172 + - . 4 = 173-40, 
211 



Hence 

93-08 - 90-32 



= 0-267, 



178-52 - 168-22 

^iZ^-J. = 0.359. 

95-52 - 88-30 
r" = 0-0958, 
r= 0-309. 

On Graph 2 two pairs of regression lines are shown. The 
broken lines are the regression lines determined by the method 
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of least squares (see 3.2.2., Example 2) and the continuous 
lines are the regression lines determined by the two-point 
method. 

GRAPH 2. 
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As can be seen from the graph, the positions of the lines 
determined by the two methods are not very different, in spite 
of the fact that the correlation is fairly weak (i.e. the points 
are widely scattered on the scatter diagram). 



3.3.3. The properties of estimates obtained by the two-point 
method 

In both examples discussed above we have seen that the 
numerical results of estimating regression parameters by the 
method of least squares and the two-point method did not 
differ much. The computations in both cases were based on 
the actual statistical material so that there is no question of 
selecting the figures on purpose in such a way as to obtain 
similar results. The similarity can be explained by certain 
general properties of estimates obtained by both methods. 
As we know (see 3.1.), the estimates of regression parameters 
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obtained by the classical method are consistent and unbiased. 
Estimates obtained by the two-point method have similar prop- 
erties. This explains the similarity of the results noticeable 
in Examples 1 and 2 of 3.3.2. 

We shall give below theorems concerning the most impor- 
tant properties of the regression coefficient # 21 defined in 
3.3.1. by one of the formulae (5), (6) and (7). These theorems 
can also be adapted to apply to the regression coefficient a l2 . 

THEOREM 1. Regression coefficient # a is a consistent esti- 
mate of regression coefficient a 2l for the general population 
Q, i.e. for every e > 

0. (1) 



THEORE?V! 2. Regression coefficient a^ is an unbiased esti- 
mate of regression coefficient a 21 , i.e. 

(flu) = oa- (2) 

In order to appraise the effectiveness of estimates obtained 
by the two-point method we should compare the variance 
of these estimates with the variance of the estimates obtained 
by the method of least squares. We know from the Markoff 
Theorem that the estimates obtained by the classical method 
have a minimum variance. To distinguish between estimates 
obtained by the two methods we shall denote them as follows : 
~~ *h e regression coefficient obtained by the method 

of least squares, 
the regression coefficient obtained by the two- 

point method. 
Let us also denote: 

class) = Ffatt class I *l = w l>-> X n = *X 
Point) = P(a point I *1 = b-* X n = n ), 



where u l9 u 2 , ..., u n are certain constants. 
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THEOREM 3. 

Kflil point I X I = W l -> *n = w n) 
= Z^'l^ ^ !%L . JLli. (3) 

^i point) ^ s l n 

Applying the Slutsky Theorem to the right side of formula 
(3), we find that e converges in probability to DlJol where 



A = \x-i*i\fi(xydx, 0; = (x- 

- oo oo 

The definition of symbol /iO) is given in 1.2.3., formula (6). 
If the distribution of population Q is normal (see 1.2.9., for- 
mula (1)), then e converges in probability to 2/n. It is inter- 
esting to note that if the random variable X has a normal 
distribution N(m,a) then the effectiveness of the median 
as an estimate of parameter m is also equal to 2/n. 
THEOREM 4. The distribution of random variable 

(4) 

approaches a normal distribution AT(0, 1) for n -> oo . The 
proofs of Theorems 1-4 are given in [31], 

3.3.4. Comments on estimating the correlation coefficient 
by the two-point method 

In examples 1 and 2 we have estimated correlation coeffi- 
cient Q in population Q by the formula 

r 2 = a n . a l2 , 

where # 2 x an ^ #12 are regression coefficients obtained by the 
two-point method. This procedure is justified by the Slutsky 
Theorem which states that if random variables X n , Y n , ..., Z n 
are stochastically convergent to the constants x, y, ..., z, then 
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any rational function of these variables R(X n , ..., Z n ) is 
stochastically convergent to the constant R(x 9 y, ..., z). 
To prove that 

r = j/<0 21 . a 12 

tends in probability to correlation coefficient Q we shall prove 
the following theorem. 

THEOREM 1. Let the sequence {X n } of random variables 
tend with probability 1 to the number a, and let y (x) denote 
the continuous function of x. Then the sequence {y (X n )} of 
random variables converges in probability 1 to ^ (a). 

Proof. 

Pflim X n = a] = 1 . 

\n-oo J 

For each continuous function ip(x), the condition 



lim X n = a 



means that 



Hence the events 

lim X = a and lim 



are equivalent and therefore 
Pllim 



It follows from the above theorem that if r 2 converges in 
probability to 2 , then r tends to g. 

In the conclusion of our discussion on estimating the cor- 
relation coefficient by the two-point method we should men- 
tion one more problem. As we know, the product of the 
regression coefficients obtained by the method of least squares 

a & class* ^12 class 

is a positive quantity with zero-one norm (see 1.2.8.). The 
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product of the regression coefficients obtained by the two- 
point method does not have this property. 
We can give numerical examples in which the product 



12 point 

is greater than one or less than zero. If the sample is suffi- 
ciently large then, on the assumption that there is linear 
correlation between the variables studied, the probability of 
the realization of such an event is negligible because the 
product 31 p 0l nt -ah point tends stochastically to Q*. 

In estimating the correlation coefficient by the two-point 
method we should observe the following convention: 

1) correlation coefficient r > when a 21 > and a lz > ; 

2) correlation coefficient r < when a zl < and a n < ; 

3) if # JX . a 12 > 1 we assume that r = 1 ; 

4) if coefficients a ai and a lz have different signs we assume 
that r = 0. 

When case 3 or 4 occurs in practice we suspect that the 
assumption about the linearity of correlation is not true. We 
also suspect this when the product a 2l -a n = I but the points 
on the scatter diagram are not located on the straight line 
(see Graph 3). 
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It follows that there are situations in which the two-point 
method provides reasons for postulating the hypothesis that 
at least one of the regression lines is not a straight line. This 
is an important advantage of the two-point method. 



4. ON TESTING CERTAIN STATISTICAL 
HYPOTHESES 

4.1. Two tests to verify the hypothesis that the distribution 
of the general population Q is normal 1 

4.1.1. The formulation of the problem 

In many theorems of theory of probability and mathematical 
statistics it is assumed that the distribution of the random 
variable is normal. In practical applications it is often very 
difficult to check to what extent this assumption is justified. 
When the subject of statistical research is an ordinary ran- 
dom variable, there are several methods of testing the 
hypothesis that the distribution of the population is normal. 
These methods do not provide sufficient grounds for accept- 
ing the hypothesis, but in some cases enable us to reject it. 
The usual procedure in practice is to assume that if the 
information obtained from the sample does not give grounds 
for rejecting the hypothesis, it can be regarded as true and 
can be accepted in the sense that the population is normal, 
without any further justification. Although this procedure is 
open to objection it has to be accepted because there is no 
other sensible way out. 

For statistical research involving multi-dimensional random 
variables, it is more difficult to test the hypothesis that the 
population is normal. We shall again be concerned here with 
continuous two-dimensional variables. As we know (e.g. see 
[7] , section 29.6) in many theorems involving such a variable 



1 Published in Przeglqd Statystyczny (Statistical Review), No. 3, 1957. 
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it is assumed that its distribution is normal, i.e. that the two- 
dimensional density function of this distribution is expressed 
by the formula 

(1) 

x-*i y-m, 



2(1 - 



y 1 



(see 1.2.9., formula (1)). 

Let us denote by # the hypothesis that the two-dimensional 
random variable (X,Y) has a normal distribution with the dens- 
ity given by (1). 

If follows from the generalization of the Central Limit Theorem 
on two-dimensional variables (see [7]) that in practice we 
often deal with this type of distribution. It is difficult to es- 
tablish this fact by experiment because it is inconvenient 
to construct spatial diagrams of the distribution. This in- 
creases the importance of statistical methods in testing hy- 
pothesis H. 

The role of these methods is particularly important in select- 
ing a function for the equation of the regression line. As we 
know (see 1.2.9., Corollary 1) regression I lines are straight 
lines when the joint distribution of random variable (X 9 Y) 
is normal. This explains why we so often deal with linear 
regression in practice. However, a distribution does not neces- 
sarily have to be normal every time a visual inspection of the 
scatter diagram based on a sample suggests that we deal with 
linear regression. When a population is normal the regression 
lines are straight lines, but when the regression lines are 
straight lines the distribution of the population may or may 
not be normal. Under these circumstances the results of 
testing the hypothesis that the distribution is normal may be 
of great practical importance. 
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The difficulties encountered in verifying this hypothesis 
with reference to a two-dimensional random variable are 
caused by the fact that the tables of the two-dimensional 
density function for a normal distribution are not easily acces- 
sible 1 . In this section we shall discuss two methods of testing 
the hypothesis that the distribution of a two-dimensional 
random variable is normal. When these methods are used 
such tables are not required. 

Both methods can be applied to large samples. 

412. Testing hypothesis H by rotating the coordinate 
system (method A) 

The consistency of the two-dimensional distribution of a 
general population with a normal distribution can be checked 
by the # 2 test. As we know (1,2.9., Theorem 1), variables X 
and Y are stochastically independent when parameter g in a 
two-dimensional normal distribution equals zero. If para- 
meter Q ^ we replace random variables X and Y by variables 
X' and Y' using a linear transformation. 

X' =- (X m^ cos + (Y m^ sin 0, 

Y'= -(X- m z ) sin + (Y m ? ) cos 0. 
Then 

E(X f , Y') = 0, and hence g(JT, Y') = 



(see 1.2.8., Theorem 4), 

Thus we can see that if a joint distribution of a two-dimen- 
sional random variable (X' 9 Y') is normal, then these variables 
are stochastically independent. Hence we can write 



where q> (x' 9 y f ) denotes the two-dimensional density of the 
normal distribution of variable (X' 9 Y') and ^ (*') and <7? 2 (y') 



1 However, such tables exist. See [40]. [44]. 
10 
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are the symbols of marginal densities of variables X' and Y' 
in this distribution. Since a joint distribution is normal, the 
distributions 9?i(X) and <p$(y') are also normal but, of course, 
in one dimension 1 . The parameters of these distributions 
are expressed by the formulae: 

E(X') = E[(X - w x ) cos + (Y - w 2 ) sin 0] = 0, 

E[X' - E(X')] 2 = [(Jf - Wl ) 2 cos 2 9 + (Y - m 2 ) 2 sin 2 + 
+ (^ - Wj) (7 - w 2 ) sin 2 0] = of cos 2 4- of sin 2 0, 

(Y') = [- (JT - wO sin + (Y - w 2 ) cos 0] = 0, 

E[Y f - (7')] 2 = [(* - i) 2 sin2 + (7 - m 2 ) 2 cos 2 - 
(JT - iwj) (F - w a ) sin 2 0] = cr 2 sin 2 + a 2 cos 2 0. 

The construction of the test for hypothesis H is based on 
the fact that variables X' and Y' are stochastically independent. 

By using the # 2 test we can easily check whether or not 
empirical marginal distributions are essentially different from 
a normal distribution. Let us denote by v x (# f ') the empirical 
marginal distribution of variable X r from the sample and by 
n the size of the sample. In this case the divergence between 
this empirical distribution of variable X' from the sample and 
a theoretical normal distribution ^(x/) is measured by the 
expression 

2 _ 





For variable Y' we get an analogous expression 
V 2_ 

Ay 



where v a (X) denotes the empirical distribution of variable 7'. 



1 Tables for a normal distribution in one dimension are given at the 
end of the book. 
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We shall denote by a the probability of an event that we 
consider as practically impossible. There exists a positive 
real number #jj dependent upon a and such that 



We have further 
and analogously 



It should be remembered that when hypothesis H is true, 
variables X' and Y f are independent, and so are variables %% 
and f y . 
We reject hypothesis H when 

(xl>x$(J(xS>%Z), (4) 

where symbol ^J means "or". 
The probability of this event is 



= 1 _ 1 + 2a- a 2 - 2a- a 2 < 2a. 

The values of 75 are taken from appropriate tables 1 . We 
call the number a the level of significance. 

Example I 2 . In an electric power station the relationship 
between the consumption of coal and the output of electric 
power is studied. In Table 1 the statistical material for a 6-year 
period is shown. The ^-column of this table represents the 
monthly output of electric current (in tens of thousands of 
kWh) measured at the generator contacts and the y-column 



1 X 2 distribution tables are given at the end of the book. 

2 The statistical data for this example have been obtained through 
the courtesy of Professor J. Falewicz. 

10* 
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contains the data on the consumption of slack coal (in tens of 
tons) used for production. The data come from a metalurgical 
electric power station equipped with two SKODA turbogener- 



TABLE 1 

CONSUMPTION OF COAL AND ELECTRICITY GENERATED IN AN 
ELECTRIC POWER STATION 



No 


X 


y 


No 


X 


y 


No 


X 


y 


No 


X 


y 


1 


183 


175 


19 


190 


193 


37 


180 


191 


55 


180 


164 


2 


184 


172 


20 


180 


169 


38 


170 


167 


56 


207 


191 


3 


180 


168 


21 


169 


177 


39 


182 


161 


57 


190 


173 


4 


164 


156 


22 


206 


190 


40 


189 


180 


58 


185 


174 


5 


177 


190 


23 


199 


186 


41 


213 


191 


59 


186 


190 


6 


159 


160 


24 


201 


180 


42 


301 


264 


60 


181 


166 


7 


147 


142 


25 


207 


182 


43 


225 


202 


61 


192 


179 


8 


151 


153 


26 


209 


190 


44 


234 


214 


62 


203 


191 


9 


164 


149 


27 


184 


164 


45 


203 


184 


63 


277 


247 


10 


122 


128 


28 


165 


164 


46 


192 


189 


64 


299 


257 


11 


167 


167 


29 


142 


149 


47 


191 


179 


65 


215 


206 


12 


188 


172 


30 


116 


133 


48 


146 


155 


66 


200 


188 


13 


180 


162 


31 


147 


164 


49 


193 


173 


67 


192 


167 


14 


156 


160 


32 


175 


168 


50 


187 


166 


68 


187 


183 


15 


163 


154 


33 


197 


176 


51 


187 


182 


69 


194 


190 


16 


175 


160 


34 


202 


186 


52 


212 


205 


70 


194 


182 


17 


173 


179 


35 


189 


183 


53 


251 


216 


71 


190 


178 


18 


158 


144 


36 


190 


176 


54 


220 


201 


72 


278 


241 



ators, each of 3,100 kW capacity, and two DUQUESNE 
boilers heated by slack coal. The consumption of slack coal 
is given in gross terms as recorded during control weighing. 
The output of power is also given in gross terms because it 
was measured at the contacts of the generators. 

A scatter diagram on the basis of the data given in Table 1 
is shown on Graph 1. 
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GRAPH 1. 
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The distribution of the points on the diagram suggests that 
the distribution of the variable (X,Y) is normal. This hy- 
pothesis should be tested. To prepare the statistical material for 
testing the hypothesis we use formulae for translating and 
rotating the coordinate system. We find the angle of rota- 
tion y by formula (7) from 1.2.8., replacing the population 
parameters by the sample parameters. In this way we test the 
hypothesis that the population has a normal distribution 
with parameters m 1 = ~x, m z = y, y. 
Let us denote: 



~x=-- \ x, y 



y, w-the size of the sample. 



It follows from calculations (see Appendix, pp. 215-216), 
that 

(x-x)-(y-y)= 56,344, 
*) 2 =79,124, 



=43,952, 
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112,688 

tan2y = - ' - = 3-20. 
35,176 



Further, we have 



y=3621', 
sin 7 = 0-592, 
= 0-806. 



After a linear transformation, according to formula (1) in 
4.1.2., we obtain the values of the new variables x\ y' which 
are given in Table 2. 

TABLE 2 
TRANSFORMATION COMPUTATIONS APPLIED TO TABLE 1 



I 


x f 


y 


1 


x' 


y 


I 


X" 


y' 


1 


x' 


y 


1 


- 8-0 


4- 0-9 


19 


+ 8-3 


+ 11-3 


37 


- 1-0 


+ 15-6 


55 


- 16-9 


- 6-2 


2 


- 9-6 


- 2-9 


20 


-14-0 


- 2-1 


38 


-23-2 


4- 2-2 


56 


-f 20-8 


- 0-4 


3 


-14-6 


- 3-0 


21 


4- 3-7 


+ 4-2 


39 


-17-1 


- 9-8 


57 


- 3-6 


- 4-8 


4 


-34-6 


- 3-2 


22 


+ 19-4 


- 1-0 


40 


- 0-2 


+ 1-4 


58 


- 7-0 


- 2-1 


5 


- 4-0 


+ 16-6 


23 


+ 11-4 


+ 0-3 


41 


+25-6 


- 4-0 


59 


- 8-6 


- 4-9 


6 


-36-2 


-f- 3-0 


24 


4- 9-5 


- 5-8 


42 


+ 13-8 


+ 2-9 


60 


- 15-0 


- 5-2 


7 


-56-6 


- 4-0 


25 


+ 14-5 


- 7-6 


43 


H-41-8 


- 2-2 


61 


4- 1-6 


- 1-2 


8 


-46-8 


+ 2-1 


26 


+21-8 


- 2-4 


44 


+56-2 


+ 2-2 


62 


+ 17-6 


+ 2-0 


9 


-38-7 


- 8-8 


27 


-13-6 


- 8-5 


45 


+23-4 


- 3-7 


63 


+ 110-4 


+ 3-3 


10 


-85-0 


- 0-9 


28 


-29-0 


-f 2-7 


46 


4- 7-5 


4- 6-9 


64 


+ 134-0 


~ 1-7 


11 


-25-6 


-f 4-0 


29 


-56-5 


4* 4-2 


47 


4- 0-8 


- 0-6 


65 


4- 36-1 


4- 7-0 


12 


- 5-8 


- 4-7 


30 


-86-9 


4- 6-7 


48 


-49-7 


+ 6-7 


66 


+ 13-4 


4- 1-3 


13 


-18-1 


- 7-8 


31 


-43-5 


+ 12-4 


49 


- M 


- 6-6 


67 


- 5-5 


-10-9 


14 


-38-7 


+ 4-8 


32 


-18-6 


0-0 


50 


-10-1 


- 8-7 


68 


- 0-1 


+ 5-0 


15 


-36-6 


- 4-2 


33 


+ 3-9 


- 6-6 


51 


- 0-6 


+ 3-2 


69 


-f 9-7 


4- 6-6 


16 


-23-3 


- 5-4 


34 


+ 13-8 


- 1-5 


52 


+33-1 


+ 7-9 


70 


+ 4-9 


- 0-1 


17 


-13-7 


+ 10-1 


35 


4- 1-6 


+ 3-8 


53 


+71-1 


- 6-3 


71 


- 0-6 


- 0-8 


18 


-46-5 


- 9-3 


36 


- 1-8 


_ 2-4 


54 


+37-2 


0-0 


72 


+ 107-6 


- 2-1 
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The scatter diagram for the data in Table 2 is shown below 
(Graph 2). 

GRAPH 2. 



20- 



-too 



% 



.- 

V 
20 



100 9 



On the basis of the data from Table 2 we now construct a 
frequency distribution for variable x' and for variable /. 
Then on the basis of formulae (2) and (3) we calculate %* and 
Xl (see Table 3 and Table 4). 

TABLE 3 
v 2 TEST APPLIED TO VARIABLE x' 



"Mr* 






, 


(n, - n(Y 








1 


J^_ 


1 

2 


~oo -50 
-50 -30 


J} 


Jl7 


0-94 


3 


-30 - 10 


14 


12-1 


0-30 


4 


-10 +10 


25 


13-7 


9-32 


5 


+ 10 430 


10 


12-1 


0-36 


6 


+ 30 +00 


10 


17-1 


2-94 




n = 72 


72-0 


33-86 



where 



n\ 
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TABLE 4 
TEST APPLIED TO VARIABLE / 











(n, - /*p 2 




y 


"J 


J 


n i 


1 

2 


oo 8 
-8 -4 


M }*> 


III} 19 ' 1 


0-04 


3 


-4 


21 


17-0 


0-94 


4 


+4 


16 


17-0 


0-06 


5 
6 


+4 +8 

H-8 4-oo 


10 ] 


11-11 

7 . 5 }l8'6 


0-36 






* = 72 


72-2 


1-40 













where w, = 



Let the level of significance 2a =-- 0-02. In this case a = 0-01. 
In the x 2 distribution tables for 4 degrees of freedom we find 
*J - 13-277. Since 

= 13-86 >*= 13-277 

we reject hypothesis H. We have rejected it on the basis of 
the same sample on which hypothetical parameters of the 
distribution were determined. Thus the reason for rejecting 
the hypothesis is that the distribution in the sample is signif- 
icantly different from a normal distribution. If we do not want 
to stop at comparing the marginal distributions of random 
variables X' and Y' with a one-dimensional normal distri- 
bution by the consistency test, we may check the consistency 
of the joint distribution of variable (X' 9 Y') with a two-dimen- 
sional normal distribution. For this purpose we have to con- 
struct a special contingency table and compare the frequen- 
cies in particular panels of this table with the theoretical 
frequencies. The latter are calculated by multiplying the fre- 
quencies of the sample by the product of probabilities cor- 
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responding to a given panel and taken from marginal distri- 
butions. Knowing the frequencies in particular panels of the 
contingency table and the theoretical frequencies, we can 
check by the % 2 test whether these frequencies differ signifi- 
cantly from one another. 

The above method of testing hypothesis H requires many 
cumbersome computations connected with the use of formu- 
la (1). We do not refer here to computations connected with 
the calculation of parameters x, y and the sums: 



These computations are needed for all methods of testing 
hypothesis //. We refer to computations involved in the pro- 
cess of testing the hypothesis. The verification of the hypothe- 
sis by the # 2 test in the way described below requires com- 
putations which become more time-consuming as the size of 
the sample increases. This is the drawback of this test. Tests 
described in textbooks (e.g. see [28] 1 ) suffer from the same 
drawback. Only a test for which the time needed for compu- 
tations does not greatly depend upon the size of the sample, 
is convenient and practical. This type of test is described 
below. We shall call it the B test to distinguish it from the 
test described above which we shall call the A test. The main 
advantage of the B test is its simplicity: it requires very few 
computations. The main disadvantage is its low "sensitivity". 

4. 13. Testing the hypothesis H by dividing the plane into 
quadrants (method B) 

Let us denote by A the event that variable X assumes a 
value greater than the average value m l9 and variable Y as- 



1 A. Hald: Statistical Theory with Engineering Applications, New 
York, 1952, p. 602-604. 
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sumes a value greater than the average value m z . This means 
that 



(1) 



where p) means "and". 
Let us further denote 



a ), (2) 

C = (AT > mO HO" <*i,), (3) 

2 ). (4) 



When m l = w a = then ,4, , C, D denote events in which 
point (x, j) lies respectively in the 1st, 3rd, 4th and 2nd quad- 
rant of the coordinate system. It can be shown (see 1.2.9., 
formula (9)) that 

ft == p(A) = P(B) = - + arcsin e (5) 

4 2n 

and 

ft - P(C) = P(D) - J - - 1 - arc sin g. (6) 

4 2:re 

Thus we know the probability of the random chance that the 
point (X, Y) is located in a particular quadrant of the plane. 
The knowledge of these probabilities allows us to construct 
a test to verify hypothesis //. This hypothesis may be veri- 
fied by many tests, but the % 2 test seems to be the most con- 
venient. 

Knowing the probability of a random occurrence of a point 
in the individual quadrants of the plane, we can easily calcu- 
late the hypothetical numbers of points in these quadrants. 
We shall call these numbers the hypothetical frequencies of 
the quadrants of the plane. The hypothetical number of points 
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in the individual quadrants will, of course, differ from the 
empirical number of points. We shall call the empirical num- 
ber of points the empirical frequency of the quadrant of the 
plane. Let us denote the hypothetical frequencies by n\ and 
the empirical frequencies by n t (i= 1, 2, 3, 4). 

The measure of divergence between the hypothetical and 
empirical frequencies is expressed by the quantity 



which is a random variable and has the % 2 distribution with 
three degrees of freedom. We reject hypothesis H when 

X*>Xl 

where xl is a number dependent upon the significance coef- 
ficient a. 

The construction of test B is based on the assumption that 
the population parameters m l9 w 2 and Q are known. In prac- 
tice this happens very rarely. Therefore, when we do not 
know the values of these parameters we have to substitute 
for them the estimates from the sample. (A similar procedure 
was also used in test A.) 

Example 1. A sample of 72 items was taken from a two- 
dimensional population. Each item in the sample may be 
treated as a point on the plane. The coordinates of these 
points represent the two-dimensional random variable (X,Y). 
On the basis of statistical data obtained from the sample we 
want to check hypothesis H that the distribution of the two- 
dimensional population is normal. The statistical data are 
shown in Table 1 in 4.1.2. 

It follows from the calculations which we shall not quote 
here 1 that 



1 See the Appendix pp. 215-216. 
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v . V 1 x 190-4 

^ ^ y\j ~. 

^_j 7 

n l 
v = - 1 -yv == 179-0, 



r = 



= 0-96. 



Assuming that m : = x, m z y and # = r we calculate the 
frequencies of the actual occurences of events A,B,C and D. 
The occurence of event A is equivalent to a random chance 
of a point being located in the 1st quadrant of the plane. It 
is assumed that the origin of the coordinate system lies at 
point (x, ~y). 

It is easy to check (using Table 1 in 4.1.2.) that event A 
has occured 24 times, event B 34 times, event C 1 times, event 
D also 7 times. 

Assuming that Q = 0-96 we determine p and p% (see for- 
mulae (5) and (6)). The values of p t and p 2 are functions of Q. 
Different values of parameters p l and p 2 correspond to dif- 
ferent values of Q. They are shown in Table 2. Using this table 
we find that p l = 0-456 and p 2 0-044 correspond to the 
number Q = 0-96. 

Since we know p and p% we can calculate the hypothetical 
frequencies of points in particular quadrants of the plane 
and we can check by the # 2 test the significance of the de- 
viations of the empirical frequencies from the hypothetical 
frequencies. 

The calculations are shown in Table 1. 
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TEST APPLIED TO VERIFICATION OF HYPOTHESIS H BY 
DIVIDING THE PLANE INTO QUADRANTS 



No 
(0 


Event 


."' 


PuPz 


-I 


w,-/ij 


<*-;> 


(,-;) 


n \ 


1 


A 


24 


0-456 


33-0 


-9-0 


81*00 


2-38 


2 


B 


34 


0-456 


33-0 


1-0 


1-00 


0-03 


^ 


c 


7l 


0-044 










4 


D 


7J M 


0-044 


6-4 


7-6 


57-76 


9-00 


Total 


72 


1-000 


72-4 




X * -11-41 



Let us assume that the level of significance a = 0-02. For 
this level of significance with two degrees of freedom 1 the 
corresponding value of %\ is 7-8. Therefore, hypothesis H 
should be rejected since 

f = 1141 >* 2 = 7-8. 

We can see from the above example that test B is very 
simple to use. For this reason it has a variety of applications, 
particularly when the testing of hypothesis H is conducted 
on a large sample. 

In conclusion it might be worth while to say a few words 
about both tests. They can be used in two cases: 

1 when all the parameters of the distribution are known 

and we are checking only its shape; 

2 when we are testing a simple hypothesis that the pop- 

ulation is two-dimensional and normal, with given 
parameters. 



1 Since, as a result of combining the last two classes in Table 1, 
we now have only three instead of four classes. 
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TABLE 2 
PARAMETERS P t AND P 2 AS FUNCTIONS OF 



Q 


1 arc sin Q 


1 arc sin Q 


Pl 4 + 2x 


P *~4 2* 


0-98 


0-468 


0-032 


0-96 


0-456 


0-044 


0-94 


0-444 


0-056 


0-92 


0-436 


0-064 


0-90 


0-428 


0-072 


0-88 


0-422 


0-078 


0-86 


0-415 


0-085 


0-84 


0-409 


0-091 


0-82 


0-403 


0-097 


0-80 


0-398 


0-102 


0-75 


0-386 


0-114 


0-70 


0-373 


0-127 


0-65 


0-363 


0-137 


0-60 


0-352 


0-148 


0-55 


0-342 


0-158 


0-50 


0-333 


0-167 


0-45 


0-325 


0-175 


0-40 


0-315 


0-185 


0-35 


0-307 


0-193 


0-30 


0-298 


0-202 


0-25 


0-290 


0-210 


0-20 


0-282 


0-218 


0-15 


0-275 


0-225 


0-10 


0-266 


0-234 


0-05 


0-258 


0*-242 



The verification of hypothesis H by both tests jointly should 
be carried out in two stages (as in two-stage sequence anal- 
ysis). In the first stage we use test B. If it provides grounds 
for rejecting hypothesis H 9 the analysis is finished. If test B 
does not enable us to reject hypothesis H 9 we move on to 
the second stage, i.e. we apply test A. 
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4.2. Checking the hypothesis that the regression lines in 
general population Q are straight lines 

4.2.1. General comments 

As we said in 1.2.5., the most difficult problem in the pro- 
cess of estimating regression parameters in a general popu- 
lation on the basis of a random sample taken from the popu- 
lation, is a proper choice of the approximation function. The 
amount of information that we have when we make such 
a choice is usually small : as a rule, all we have are the numer- 
ical data from the sample and the scatter diagram. From 
the distribution of the points on the scatter diagram we at- 
tempt to guess to which class the function appearing in the 
regression equation of the general population belongs. The 
word "guess" reflects very well the idea behind this procedure. 
When searching for this class we are groping in the dark. 
We cannot state anything; we can only guess. In this guess- 
ing the information supplied by the sample is useful and 
helpful : it allows us to formulate a statistical hypothesis that 
the function appearing in the regression equation belongs to 
a certain class of functions. The data from the sample enable 
us to test this hypothesis. 

In this section we shall discuss the methods of testing the 
hypothesis that the regression equation is a linear function, 
i.e. that 

g(x)=ax+p. 

This hypothesis we shall denote by the symbol H L . In this 
case 



The verification of this hypothesis is of great practical impor- 
tance: as long as there are no grounds for rejecting hypothesis 
H L we can consider that the regression lines in the general 
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population are straight lines. No other statistical hypothesis 
will be better than hypothesis H L \ therefore we can abandon 
them all and retain to hypothesis H L . The acceptance of some 
other hypothesis, even if it is equivalent to hypothesis H L 
from the statistical point of view, results in serious incon- 
veniences connected with dealing with a regression curve 
instead of a regression line and thus with the necessity of 
determining the parameters of a curve instead of those a line. 
In the literature on the subject we can easily find (e.g. see 
[16], p. 397) a description of methods of testing hypothesis 
H L by a large sample. In practice, however, it is often neces- 
sary to test this hypothesis on the basis of a small sample. 
In 4.2.2. we propose a test which enables the verification of 
hypothesis H L when the sample is small. In the following 
item we describe, after Barkowski and Smirnow, a method 
of testing hypothesis H L in a large sample. 

4 2.2. Testing hypothesis H L in a sma I sample by a run 
test 

Hypothesis H L can be verified by a run test. We shall de- 
scribe this test briefly. 

Let x l9 ;x: 2 , ..., x n denote the realization of random variable 
X determined on the basis of the elements of general popu- 
lation Q, and let F(x) denote the distribution of variable X. 
If co is a sample composed of n items and taken from Q, then 
Xi, x& ..., x n can be treated as the values of items selected for 
the sample (by "the values of items" we mean the actual 
values of random variable X corresponding to particular 
items of the sample). In repeated sampling, the values drawn 
may be treated as the realization of a finite sequence of vari- 
ables X 19 X 29 ..., X n9 corresponding to the numbers of the 
items of sample ro. If the sample is random, then 

1) random variables X l9 X& ..., X n are independent; 
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2) they have the same distribution. 
In this case 

P(x A , x ..., * B ) - P(xJ . P(* 2 ), ..., P(xJ. (1) 

The probability P(x l9 x 29 ..., x n ) is not dependent upon the 
order of the variables. It follows that if sample co is random 
then the n\ permutations formed from the numbers x l9 x 2 , ..., x n 
have the same probability of occurrence. 

Suppose that we have a sequence of n items composed 
exclusively of elements A and B. Here is an example of such 
a sequence: 

A 9 A,A,B,A 9 B 9 B 9 A 9 B 9 A. (2) 

We have here n = 10 items, among which there are n t = 6 
elements A, and n 2 = 4 elements B. Each sub-sequence with 
the largest possible number of items of the same kind is called 
a run. The number K of items comprising a given run is called 
the length of the run. Both the length of the run K and the 
number of runs R are random variables. The distributions of 
these variables are known. This enables us to test the hy- 
pothesis that sample co was taken at random. Below we show 
a table taken from [16], p. 340, (A more detailed discussion 
of the problems related to run theory can be found in Chap- 
ter XIII of [20].) This table helps to verify the hypothesis by 
a run test. Symbol R K appearing at the heading of the second 
column denotes the total number of runs with lengths not 
less than K; symbol R 1K denotes the number of runs com- 
posed of elements A, of length no less than K, and R% K denotes 
the number of runs composed of elements B of length no less 
than K. In the table are given the maximum values of the 
number of observations n which satisfy one of the inequal- 
ities shown at the head of the second, third or fourth column, 
with probability less than 0-05. The method of using the run 
test described above for the verification of hypothesis H L 
will be explained by an example. 
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TABLE 1 
THE RUN TEST 



Length 



The greatest number of observations n for which 

the probability of satisfying the inequalities 

shown below is less than 0-05 



Ul 1UIJ TV 


**>! 


and *i*>l 

*2*>1 
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5 


10 


16 


10 


6 


14 


32 


18 


7 


22 


64 


28 


8 


34 


120 


48 


9 


54 


230 


80 


10 


86 




130 


11 


140 




230 


12 


230 




420 



Example 1. Table 2 contains statistical data collected in 

a Wroclaw brewery. The monthly production of beer in 

hectolitres is given in column x, and the cost of labour in 
zlotys in column y. 

TABLE 2 

BEER PRODUCTION AND LABOUR COSTS IN A WROCLAW 
BREWERY 



No 


X 


y 


No 


X 


y 


1 


1,225 


2,712,505 


12 


8,488 


3,418,286 


2 


5,584 


2,528,475 


13 


13,103 


4,127,280 


3 


6,520 


3,121,262 


14 


14,472 


4,136,483 


4 


11,429 


3,393,046 


15 


19,506 


4,722,553 


5 


13,707 


3,754,896 


16 


20,017 


4,662,901 


6 


11,033 


3,922,740 


17 


19,328 


5,740,375 


7 


12,891 


4,171,386 


18 


19,713 


5,301,217 


8 


14,136 


4,523,888 


19 


13,563 


4,801,669 


9 


13,303 


4,475,384 


20 


10,408 


4,554,512 


10 


9,465 


3,851,908 


21 


8,805 


4,090,115 


11 


7,277 


3,400,815 


22 


10,683 


4,093,417 
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There is a relationship between the cost of labour and 
production. The regression line is a statistical expression 
of this relationship. The scatter diagram shown in Graph 1 
suggests that the regression line is a straight line. This sup- 
position may be treated as hypothesis H L . 



GRAPH 1. 



60- 




5 '0 15 

production of beer ( thousand hi/month) 

The consecutive stages of the verification of this hypothe- 
sis are stated below. 

1. Assuming that hypothesis H L is true we estimate (by 
any method) the parameters of the regression line. In 
our example the equation of this line is: 



where the coefficient b = 23-8 is measured in hundred 
thousand zlotys per month and the coefficient a = 0-141 
in hundred thousand zlotys per hectolitre. 



1 Computation Table is shown in the Appendix, p. 217. 



11* 



164 Linear regression 

2. We denote by A the event that a point lies above the 
regression line, and by B the event that a point lies below 
this line. Points lying directly on the line are not taken 
into consideration. In practice this last event has no 
chance of being realized since in cases of continuous 
variables its probability is zero. 

3. We arrange points according to increasing values of 
the abscissa. On the assumption that hypothesis H L 
is true we can consider that the deviations of particular 
points from the line y= 23-8 f 0-141;c are of random 
character and do not depend upon the order of the suc- 
cession of the points. In our example the points are 
arranged in the following order: 

1,2,3,11,12,21,10,20,22,6,4,7,13,9,19,5,8,14,17,15,18,16. 

The figures denote the numbers of points in Table 2. 
It may happen that in a sample there are points on both 
sides of the regression line, but with the same abscissa. 
In practice this is possible only when the abscissa X of 
variable (X,Y) is a discrete variable. In this case, how- 
ever, the verification of hypothesis H L is no longer 
needed and therefore we shall not consider this case. 

4. Using the accepted arrangement of the points, we write 
the sequence of the realized events A and B. In our 
example the following sequence of events has been 
obtained: 

A 9 B 9 B 9 B 9 B 9 A 9 A 9 A 9 AJ 9 B 9 B 9 B,A,A 9 B 9 A 9 B 9 A 9 B 9 A 9 B. 

5. We find the maximum length of run K and, using 
Table 1, we analyse whether there are grounds for rejecting 
hypothesis H L at the level of significance a = 0-05. We 
reject this hypothesis when the test shows that the de- 
viations of the points from the regression line are not 
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random, but that the points show a certain tendency in 
their location above or below the line. In our example 



Using Table 1 we can state that there are no grounds 
for rejecting hypothesis H L . 

When the sample is small, the testing of hypothesis H L by 
a run test is very convenient since, as a rule, there are no 
computations involved other than those related to the deter- 
mination of regression parameters. Checking whether 
a point lies below or above the regression line is done 
from the graph. To avoid difficulties the scale of the graph 
has to be properly selected. If a point on the graph lies exactly 
on the regression line (on Graph 1 points No. 6 and 11) 
we have to make appropriate calculations and check whether 
this point really lies on the regression line or whether it only 
appears to be located directly on the line because of the scale 
used in the graph and the drawing technique (upon which 
the thickness of the line and the size of the point depends). 
If random variable (X,Y) is continuous and if the sample is 
small and the graph sufficiently large, then it is seldom neces- 
sary to carry out computations to check whether a point lies 
on the line or close to it. 



4.2.3. Testing hypothesis H L in a large sample by Fisher's 
test 

The verification of hypothesis H L by a run test becomes 
troublesome when the sample is very large. In such cases the 
checking of each point, whether below or above the regres- 
sion line, takes too much time even if we do not make calcula- 
tions but use only the graph. 

The verification of hypothesis H L in a large sample can be 
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done by Fisher's test F. It can be proved that the random 
variable 



has the distribution F with the number of the degrees of 
freedom k^ = 12 and k 2 = n 1. In formula (!) symbol r/J de- 
notes an estimate of parameter rfa on the basis of the sample (see 
1.2.8.); n 9 as usual, denotes the size of the sample, and / the 
number of values that variable X assumes in the contingency 
table (in other words / is the number of rows in this table). 
A detailed description as to how test F should be used to 
verify hypothesis H L , together with a numerical example, can 
be found in [16], p. 397. 

4.3. An analysis of the significance of regression parameters 

The regression coefficient in a sample is a random variable 
with its own distribution, its expected value and variance. 
Bartlett has shown (see [4]) that if the distribution of the 
random variable (^,7) is normal 1 , then the variable 



has Student's distribution with n2 degrees of freedom 2 . 



1 We shall mention in passing that the Bartlett test can only be 
used if the distribution of variables (X 9 Y) is normal. That is one of the 
reasons why in 4.1. we gave two tests for verifying the hypothesis that the 
distribution of the variable (X,Y) is normal. 

8 Tables for Student's distribution are given at the end of the book. 
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In formula (1) 



Z(*> -x)(y, -y) 



i 

By elementary transformations we can show that 



$! ]/ n 2 , . /0 , 

(2) 



where 



The knowledge of the distribution of variable t' enables us to 
determine the confidence region which will cover the unknown 
value of parameter a 21 with probability a. Knowing the distri- 
bution of variable t' we can write that 

i /' e /' c i 

P I / * 21 ^ ^- \ l ' ^ 2 1 I /"*\ 

/-{a 21 - ^ , =r < on < # 21 + ~ _=, | = a. (3) 

Sjj/n 2 J 



i S^ii-2 

We can also prove that the variable 



' 



has Student's distribution with n 2 degrees of freedom. 
Hence 



l/ii 2 2 a i/n 2 
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In formulae (4) and (5) the parameter b. M ya^x. It can 
also be proved that the variable 



J. 



l/ii 2 



X) 2 



(6) 



SJ 



has Student's distribution with n 2 degrees of freedom. 
Therefore 



y 



1 + 



5? 



-2 



-2 



(7) 



We shall show the formal relationship between formulae (2) 
and (4), and formula (7). Let us consider the regression line 
equation in the sample 



where 






Let us assume that 



ii -2 
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?2 
>21 



S?(fi-2) (it -2) 



Hence 



Therefore, on the basis of (8) and (9) 

2i 



b on #>r 



(9) 
(10) 

(11) 
(12) 



All comments concerning the regression parameters of Y on X 
also apply to the regression parameters of X on Y. 

Example 1. The scatter diagram on Graph 1 shows the 
relationship between the average monthly expenditures for 
consumption and the average monthly income of twenty 
four-member families drawn from among four-member 
families included in family budget studies in Lower Silesia. 
The statistical data on which the diagram is based come from 

Table 2 in 3.2.2. 

GRAPH 1. 
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The two straight lines shown on Graph 1 are regression 
lines determined by the classical method and the two-point 
method. The positions of the two lines do not differ much. 
The equation of the regression line determined by the method 
of least squares is 

y = 47 + 0-31X, 

and the equation of the regression line determined by the 
two-point method is 

y = 55 + 0-28.X 1 . 

The equations of the two continuous curves shown on 
Graph 1 are expressed by the formula 



-__ 

y+ - ^ . (13) 

n 2 

In our example 

y =, 47+0-31 x, 

SI = 4,269, 
S 2l = 20-69, 
x =281, 
n -20. 

Thus the equation of the continuous curve located above the 
regression lines and corresponding to the value t = 1 will 
assume the following form: 



A 

= 47 + 0-31x + 20-69. ' 



1/18 



1 The computation table is shown in the Appendix p. 218. 
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and the equation of the continuous curve located below the 
regression lines and corresponding to the value t = 1 is 
expressed by the formula 



y = 47 + 0-31*- 20-69. 



Curves (13) determine the confidence region which will cover 
the regression line in the general population Q with probability a. 
For a = 0-98 with 18 degrees of freedom, we find in Student's 
distribution table: 

t = 2-55. 

The two interrupted curves on Graph 1 are curves (13) for 
t = 2-55 1 . 

Using the confidence region we can decide whether the 
position of the regression line obtained by the method of 
least squares differs significantly from the position of the regres- 
sion line obtained by the two-point method. Let us denote 
by H T the statistical hypothesis that there is only a random 
difference between the position of the regression line deter- 
mined by the method of least squares on the basis of statis- 
tical data from sample a> and the position of the regression 
line obtained on the basis of the same data by the two-point 
method. To test hypothesis H r we have to select a number a 
and accordingly draw two lines determining the confidence 
region. We reject hypothesis H r when the line determined 
by the two-point method intersects one of the curves deter- 
mining the confidence region. We can see from the graph 
that in our example there are no grounds for rejecting hy- 



1 See the Appendix p. 219. 
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pothesis H r since the line determined by the two-point method 
does not intersect either of the two broken curves drawn on 
Graph 1 . These lines correspond to the confidence coefficient 
a - 0-98. 

We know that if a regression I line in a general population 
is a straight line the regression parameters in the population 
calculated by the two methods will be the same. This means 
that the estimates of regression parameters obtained from the 
sample by the method of least squares and by the two-point 
method should not show any significant difference if the as- 
sumption is true that the regression line of the population is 
a straight line. It follows that the verification of hypothesis 
H r is equivalent to the verification of hypothesis H L . In our 
example hypothesis H r has not been rejected. It is easy to 
find out (using the run test described in 4.2.2.) that there are 
also no grounds for rejecting hypothesis H L . 

The verification of hypothesis H L by the determination of 
the confidence region for the regression line is too cumber- 
some. Let us remember, however, that the regression lines 
of the sample determined by both methods go through point 
(x,y). Therefore, instead of checking whether the positions 
of the two regression lines differ significantly from one an- 
other, it is sufficient to find out whether their slopes differ 
significantly. To do this we proceed as follows. 

We select number a and determine the confidence region 
f r a zi class according to formula (3). Then we check whether 
#21 point 1 ^ es within this region. We reject hypothesis H L 
when a& polnt lies outside the confidence region, i.e. is in 
the critical area. 



1 Let us remember that a^ c i ass is the regression coefficient obtained 
by the classical method, and a 2 i point denotes the regression coefficient 
obtained by the two-point method (see List of Symbols at the end of the 
book). 
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In our example we have: 

*21 class = 0'31, 

a point = 0-28, 
5 a = 20-69, 

5J ^ 4,269, hence ^ = 65, 
a -- 0-98, 
t 1 = 2-55, 
/i = 20. 
Therefore 

nil 20-69.2-55 ^ n ,,, 20-69.2-55 

' 31 "^ 65 - <a * <0 ' 3i+ --4>3~^5- 
or 

0-12 < a 21 <0-50. 
Since 

0-12 <a 21 Folnt - 0-28 < 0-50, 

the slopes of the two lines are not significantly different so 
that there are no grounds for rejecting hypothesis H L . 

The application of the two-point method to the verification 
of hypothesis H L is very convenient since there are few extra 
computations involved. If we have made all calculations 
required to determine parameter a^ class the determination 
of parameter a ?l polnt is simple because most of the calcula- 
tions needed for the classical method can be used in the two- 
point method. 

We can see from this example that the two-point method 
is not only useful for estimating the regression parameters, 
but can also be applied to the verification of hypothesis H L > 

Example 2. Table 1 contains data from monthly reports 
on the production of beer in hundreds of hectolitres (x) and 
the cost of electric power in thousands of zlotys (y). The 
figures follow the chronological order. 
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TABLE 1 

BEER PRODUCTION AND POWER COSTS IN A WROCLAW 
BREWERY 



No 


X 


y 


No 


X 


.V 


1 


12 


292 


11 


73 


459 


2 


56 


308 


12 


62 


446 


3 


65 


388 


13 


195 


414 


4 


114 


388 


14 


200 


463 


5 


137 


517 


15 


193 


448 


6 


110 


545 


16 


197 


449 


7 


129 


536 


17 


136 


435 


8 


141 


536 


18 


104 


373 


9 


133 


561 


19 


88 


361 


10 


95 


512 


20 


107 


366 



The data from this table were used for Graph 2. 

GRAPH 2. 
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There are three straight lines on this graph. The line with 
the equation y = 359 + 0-69;c was determined from full 
statistical material, i.e. from the data pertaining to all 
points. We shall call it line I. At first glance this line 
does not arouse' any doubts. Let us note, however, that 
8 consecutive points (marked on the graph by crosses) are 
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located below the regression line. These points correspond 
to the data for the last eight months shown in Table 1. An 
event consisting of 8 consecutive points from among 20 points 
distributed at random on both sides of the regression line 
and located on one side of the line, has a small probability 
of occurrence. A run test indicates that the line with equation 
y = 359 + 0-69.X cannot be considered as a regression line. 
For this reason the data contained in Table 1 should be di- 
vided into two parts 1 . In the first part the first 12 observations 
are included. The equation of the regression line based on 
these data is y = 277 + 1-93*. Let us call it line II. The second 
part comprises the remaining 8 observations. The equation 
of the corresponding regression line is y = 300 -j- 0-75x. We 
shall call it line III. Using formula (6) we can discover that 
the position of line 11 differs significantly from that of line IIL 
This much information has been provided by a formal anal- 
ysis of the data shown in Table 1. Let us now comment upon 
the economic aspect of the problem. The efforts of the factory 
personnel to reduce the cost of production were effective: 
the cost of electric power was substantially lowered. Success 
came imperceptibly. The daily efforts of each worker to save 
electric power finally had a visible cumulative effect over 
a period of several months. The nature of the relationship 
between the cost of power and production 2 changed signifi- 
cantly. The variable part of the cost of electric power was 
lowered. In the first 12 months this cost amounted to 
19-3 zlotys/hl.; in the next 8 months it only amounted to 
7-5 zlotys/hl. This is a very important achievement by the 
workers of the enterprise. 

Let us here make the following comment: an analysis as to 
whether the position of the regression line in population fJ L 



1 See the Appendix computation table on pp. 220-222. 

2 Let us note in passing that we could not learn about this relation- 
ship without the assistance of correlation analysis. 
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differs significantly from the position of the regression line 
in population Q z is possible only when the hypothesis is true 
that the standard errors of estimates (sec 1.2.7.) in Q l9 and 
Q% are the same. This hypothesis can be verified by Fisher's 
F test, or (especially when the sample is small) by Sadowski's 
test [48]. 

4.4. An analysis of the significance of the correlation 
coefficient 

If the distribution of the random variable (X,Y) defined on 
the elements of the general population Q is normal and if the 
correlation coefficient Q of the population is close to zero, 
then, for a sufficiently large n> the distribution of the cor- 
relation coefficient r in sample co drawn from Q does not dif- 
fer much from a normal distribution with parameters 



\n 1 

Therefore, for Q close to zero and for a large n, the distribu- 
tion of the random variable 

^pL.j/^T a) 

i-e 2 

is close to normal jV(0,l) (see [28] 1 ). 
It can also be proved that if Q then the random 



r 



variable t = j/i "2 (2) 

1 r 2 

has Student's distribution with n 2 degrees of freedom. 
Thus we can easily test the hypothesis that Q = 0. We shall 
denote this hypothesis by H Q . In this case 



1 A. Hald : ^Statistical Theory with Engineering Applications, New 
York, 1952, pTfiOS. 
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In practice it is often necessary to test hypothesis 7/ ; there- 
fore we shall illustrate below the procedure involved in check- 
ing this hypothesis. 

Fisher, who studied the distribution of the correlation 
coefficient, obtained interesting results with wide practical 
implications, by introducing the variable 

, = llniI~ 1-513 logiA (3) 

2 1 r 1 r 

where 1 < r < 1 ; oo<z<0. 

In formula (3) In denotes the natural logarithm, and log stands 

for the logarithm to the base 10. 

As n increases the distribution z converges rapidly to 
the normal distribution. Since 



2 l e ' 2(H 1) 
and 

V(T\ ( $\ 

Y \) <~> 5 \^J 

n 3 
then the distribution of the random variable 

^3 (6) 



is close to the normal distribution N(0,l). In this case 

P \ z /L < E (z) < z + -7=^=4 = a, (7) 

I i/n-3 A-3J 

where < a < 1. 

Knowing the confidence region for E(z) we can easily write 
the confidence region for E(r) = Q. Let us denote 



In this case 



1 _<> 2(n - 1) 

12 
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If we omit the second component of the sum between the 
inequality signs /since it is small in comparison with- =^-_), 
then after elementary transformation we obtain 

1 



< e < 



> + 1 



a. 



(8) 



The double inequality within the brackets determines the con- 
fidence region for . 

Let us now illustrate by an example the verification of 
hypothesis H Q . 

Example 1. Table 1 contains the data on the average 
monthly income (x) and the average monthly expenditures 
on drink (y) for 30 four-member families drawn from among 
four-member families included in family budget studies in 
Lower Silesia. 



TABLE 1 

MONTHLY INCOMES AND MONTHLY EXPENDITURES ON DRINK 
FOR 30 FOUR-MEMBER FAMILIES 



No 


X 


y 


No 


X 


y 


No 


X 


y 


1 


2,568 


42 


11 


2,563 


15 


21 


9,263 


47 


2 


2,538 


42 


12 


4,058 


71 


22 


3,639 


76 


3 


2,491 


46 


13 


2,577 


71 


23 


4,861 


231 


4 


3,442 


34 


14 


2,129 


39 


24 


4,918 


78 


5 


2,462 


29 


15 


3,450 


33 


25 


2,607 


82 


6 


4,111 


36 


16 


2,734 


68 


26 


3,904 


119 


7 


2,170 


134 


17 


2,515 


118 


27 


3,594 


71 


8 


2,191 


74 


18 


2,251 


27 


28 


3,884 


134 


9 


3,586 


153 


19 


2,544 


17 


29 


3,110 


146 


10 


3,777 


210 


20 


1,940 


32 


30 


5,388 


54 
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The scatter diagram shown on Graph 1 is based on the data 
from this table. 

GRAPH 1. 
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The distribution of the points on the scatter diagram sug- 
gests that the relationship between the expenditures on drink 
and the size of income is very weak. In economic language 
this means that the want called "drink" is fully satisfied in 
Poland. 

t This rather regrettable fact is generally known and statis- 
tics serve only to confirm it. It follows from the calculations 
that the correlation coefficient between the expenditures on 
drink and income is 0-19. Let us formulate the hypothesis 
HQ that the correlation coefficient Q = 0. To test this hypothe- 
sis let us calculate the value of t according to formula (2). 
We have 

,= _L 19 _y28-,H)4. 
1-(0-19) 2 F 

In Student's distribution tables for a = 0-05 with 28 degrees 
of freedom, we find that t = 2-045. Since 

t= 1 -04 <f = 2-045, 

there are no grounds for rejecting the hypothesis H that the 
correlation coefficient between the expenditures on drink and 
income equals zero. 

12* 



5. THE TRANSFORMATION OF CURVILINEAR INTO 
LINEAR REGRESSION 

It may happen that the points on the scatter diagram are 
so distributed that we should reject the hypothesis H L (see 
4.2.) stating that the correlation is linear. We can then proceed 
in one of the/ollowing two ways : either by determining the para- 
meters of the segments of straight lines forming a broken line 
or by selecting a family of curves and determining the 
parameters of one of the curves belonging to this family. 

We shall not discuss here the method of determining the 
broken line since this can be reduced to the determination of 
the parameters of a linear regression, but we shall deal with 
certain cases of curvilinear regression. Suppose that the sto- 
chastic relationship between variables x and y can be well 
described by a function which is graphically presented as a 
curve. Let us consider a family of such functions; by appro- 
priate transformations they can be reduced to the linear form: 

z-Av + y, (1) 

where z = z(y), v = v(x) and A and y are constants. Table 1 
shows the most commonly used transformations and the 
functions that can be obtained by them. In equation (1) there 
are two parameters. However, it may happen that for the 
approximating curve to describe properly the distribution of 
the points on the diagram, the equation of the curve must 
be a function with more than two parameters. Then, of course, 
linear transformation cannot be used. In economic research, 
however, it is seldom necessary to use the type of function 
having more than two parameters. 
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The functions reviewed in Table 1 have many applications 
in econometrics. For instance, in studies on demand we deal 
with hyperbolic relationships (see 2.2.3.). A parabola, an ex- 
ponential curve and a logarithmic curve are used in the theory 
of wants (see 2.2.1.). In demographic research exponential 
curves are most frequently used (this will be discussed in 
Example 1). A hyperbola is used in the theory of costs (see 
2.2.4.). To the analysis of time series various functions are 
applied, the most frequently used after linear function being 
trigonometric functions and the function that is geometrically 
represented by a "logistic curve". 

Winkler's work [61] provides a review of the more impor- 
tant functional relationships in economics. For this reason 
the book is well worth reading. In studying techno-economic 
relationships (see 2.2.6.) and in determining distribution cur- 
ves (see 2.2.2., Graph 1) various functions are used. However, 
in these cases, too, the functions most frequently employed 
are those shown in Table 1. 

Linear transformation is used mainly because it enables 
us to satisfy the conditions required by the Markoff Theorem. 

If the parameters of any approximating function obtained 
by formula (1) are determined by the method of least squares, 
then we know from the Markoff Theorem that these para- 
meters are consistent, unbiased and the most effective esti- 
mates. 

Linear transformation is also useful because it considerably 
simplifies calculations. This is of great practical importance 
and, therefore, we shall discuss this aspect in greater detail. 
As we know, in order to determine the values of constant 
parameters in the equation of the approximating function 
y = g( x ) by the method of least squares the partial deriva- 
tives have to be calculated for the expression 
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GRAPH 6. 



GRAPH 7. 



This is to be equaled to zero and the set of normal equations 
so obtained then solved. 

It is usually difficult and sometimes impossible to solve 
this set by algebraic methods. The application of approxima- 
tion methods further complicates the computation procedure 
so that it is of little value in practice. 

To illustrate, let us consider the exponential function 



We want to determine the parameters of this function and, 
therefore, we want 

S = JT 1 \y i ba Ti ] 2 to be a minimum. 
/ 

Calculate the partial derivatives 



da 
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Equate them to zero: 

JTV* y i - b a^i = 0, 

i i 

a 2 ''- 1 = 0. 



As we can see, there are difficulties in solving this set of nor- 
mal equations (containing only two unknowns). These diffi- 
culties disappear when we apply linear transformation. Thus we 
calculate In y i and minimize the expression 



where 



A = In a, y In b. 

From the solution of the set of normal equations we obtain 
the known formulae 



Having the values of A and y we can calculate parameters a 
and 6 without difficulty. 

It can be seen from the above comments that there are good 
reasons why linear transformation should be used. We have 
to remember, however, that in spite of the fact that the com- 
putation is thus considerably facilitated, it is fairly difficult 
to determine the parameters of the regression line by the 
method of least squares. This is due to the fact that by per- 
forming the calculations on the numbers x t and y t required 
for linear transformation, we obtain three-, four,- or five- 
digit numbers which cannot be rounded off radically without 
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endangering the accuracy of these calculations. Under these 
circumstances the two-point method is very useful because 
it enables us to determine the regression parameters without 
cumbersome computations. This method is used in the numer- 
ical examples given below. 

Example 1. 

TABLE 2 
GROWTH OF POPULATION IN SWEDEN 1750-1935 



Years 


Population 


/ 


X 


z = In x 


1750 


1,780,678 


1 


178 


5-18178 


1760 


1,925,248 


2 


193 


5-25750 


1770 


2,042,574 


3 


204 


5-31812 


1780 


2,118,281 


4 


212 


5-35659 


1790 


2,187,732 


5 


219 


5-38907 


I860 


2,347,303 


6 


235 


5-45959 


1810 


2,396,351 


7 


240 


5-48064 


1820 


2,584,690 


8 


258 


5-55296 


1830 


2,888,082 


9 


289 


5-66643 


1840 


3,138,887 


10 


314 


5-74939 






27 = 55 




27 = 54-41207 


1850 


3,482,541 


11 


348 


5-85220 


1860 


3,859,728 


12 


386 


5-95586 


1870 


4,168,525 


13 


417 


6-03309 


1880 


4,565,668 


14 


457 


6-12468 


1890 


4,784,981 


15 


478 


6-16961 


1900 


5,136,441 


16 


514 


6-24222 


1910 


5,522,403 


17 


552 


6-31355 


1920 


5,904,489 


18 


590 


6-38012 


1930 


6,142,191 


19 


614 


6-41999 


1935 


6,250,506 


19-5 


625 


6-43775 






= 154-5 




27 61-92907 






= 209-5 




27 116-34114 
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In Table 2 the population of Sweden is shown for 1750- 
1935 (see [61], p. 158). In column / the consecutive numbers 
of years are given, and in column x the population figures 
rounded off to three significant digits. 

Parameters A and y are calculated by formulae (7) and (8), 
3.3.1. 
We have 



54^1207 =5-441 
10 



7 (2) == 15-450000, 
7 (1) = = 5-500000, 



5 =- = 5*17557, 

20 



= 10-4800000, 



20 



15-45 - 5-5 

y = In b = 5-81756 - 0-075548 . 10-48 ** 5-02582, 
b = 148. 

A straight line with the equation z = 5-02582+0-075548^ is 
shown on Graph 8. As we can see, it fits very well to the 
distribution of the points on the graph. These points show 
a clear linear tendency. Note that the points have been plotted 
in the coordinate system voz and not in the system toz, be- 
cause Graph 8 shows the distribution of points after linear 
transformation. This transformation was needed for the 



188 



Linear regression 



determination of parameters a and b. After these parameters 
are found we can return to the original distribution, i.e. the 
distribution before transformation. This distribution, together 
with the regression line fitted to it, is shown on Graph 9, 
p. 189. As can be seen, an exponential curve well represents 
the dynamics of population growth in Sweden. A major devia- 
tion can be noticed only in the first and last years of the period 
studied. It can be seen from the graph that the parameters of 
the curve have been properly chosen and so it can be said 
that the dependence of the population growth in Sweden 
upon time, in the period 1750-1935, can be approximated 
by the exponential function with the equation 

*= 148. e ' 075548 '. 

The application of linear transformation enabled us to de- 
termine parameters a and b without difficulty; the known 
formulae for determining the parameters of the straight line 
were used. Substantial simplifications in the computation 
were also achieved by the application of the two-point method 
to the determination of the regression parameters. 

GRAPH 8. 
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Example 2. W. Stys has noticed during his demographic 
studies that there is an interesting relationship be- 
tween the number of children in peasant families and the 
size of farms possessed by these families. This relation- 
ship can be consider significant since it appears very clearly 
on the basis of abundant statistical data (the sample covers 
8,505 families). The relationship noticed by Stys can very 
well be described by a power function. The equation of the 
regression line calculated by the method of least squares and 
with the application of a linear transformation (see [55]) is 



Since in linear transformation it is necessary to take 
logarithms which provide five-digit numbers and since 
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the frequency of the distribution is expressed by numbers 
with three of four significant digits, the computations con- 
nected with the determination of the parameters of the re- 
gression curve by the method of least squares are cumber- 
some and time-consuming. Incomparably simpler calcula- 
tions are required for the two-point method. To illustrate 
we shall compute the parameters of the power curve by this 
method. The data are taken from Table 3. 



TABLE 3 

NUMBER OF CHILDREN IN, AND FARMS OWNED BY, 
PEASANT FAMILIES 



i 


X 


y 


1 


0-25 


4-88 


2 


0-75 


5-20 


3 


1-50 


5-79 


4 


2-50 


6-16 








5 


3-50 


6-57 


6 


4-50 


6-83 


7 


6-00 


7-00 


8 


8-50 


7-67 


9 


12-50 


7-90 


10 


17-50 


8-59 


11 


25-00 


8-66 


12 


40-00 


9-11 



"i 


v log x 


z = logj.' 


fJi.V 


/7 t Z 


317 


- 0-6021 


0-6884 


-190-8657 


218-2228 


658 


-0-1249 


0-7160 


- 82-1842 


471-1280 


1,509 


0-1761 


0-7627 


265-7349 


1,150-9143 


1,584 


0-3979 


0-7896 


630-2736 


1,250-7264 


=4,068 






27 = 622-9586 


27=3,090-9915 


836 


0-5441 


0-8176 


454-8676 


683-5136 


961 


0-6532 


0-8344 


627-7252 


801-8584 


1,319 


0-7782 


0-8451 


1,026-4458 


1,114-6869 


620 


0-9294 


0-8848 


576-2280 


548-5760 


509 


1-0969 


0-8976 


558-3221 


456-8784 


139 


1-2430 


0-9340 


172-7770 


129-8260 


44 


1-3979 


0-9375 


61-5076 


41-2500 


9 


1-6021 


0-9595 


14-4189 


8-6595 


=4,437 






Z= 3,492-2922 


r= 3,785-2488 



Variable Y denotes the average number of children born to 
a farmer of the previous generation and variable X stands 
for the size of the farm. Assuming that the relationship be- 
tween variables X and Y is expressed by the formula 
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after using logarithms we get the linear expression: 

z = fa + y, 

where z = log j>, v = log x 9 A = a, y = log 6. 
It should be explained that the division of the sample co into 
the two subgroups co l9 and o> 2 required for the two-point 
method has been done in such a way as to make the frequen- 
cies of these subgroups as close to one another as possible. 
Because of the asymmetry of the distribution of variable 
(X 9 Y) 9 subgroup co^ contains 4 classes of the frequency distri- 
bution, and subgroup o> 2 , 8 classes. 

Below are the calculations connected with the determina- 
tion of the values of parameters A and 7. 

= 0-8531, 



4,437 



J <1>= == 0-7598, 
4,068 

3,492 



623 n- 



0-7869 -0-1531 
7 = 0-8531 - 0-147 . 0-7869 = 0-7374. 

Hence 

a= 0-147, b= 5-46. 

Therefore, the equation of the regression curve determined by 
the two-point method is 
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On Graph 10 a scatter diagram is shown with two regression 
curves determined by the classical method (broken line) and 
the two-point method (continuous line). It can be seen from 
the graph that both curves are good representations of the 
distribution of points on the graph. However, the determi- 
nation of the regression curve by the two-point method is 
much easier and, therefore, this method turns out to be more 
useful in this example. 



6. THE REGRESSION LINE AND THE TREND 

6.1. The definition of trend 1 

Let us denote by Q t a collection of items composing a gen- 
eral population. To each item of this collection a value of the 
random variable X t is assigned; it is known that its distribu- 
tion depends on the time t. The relationship is such that 

*=E(X\t) = y(t\ (1) 

where y(f) is a function determined for at least those values 
of t that satisfy the inequality < t < m. 

In the special case when y(i) is linear, formula (1) assumes 
the following form: 

* - at + p, (2) 

where a and ft are constants. 

Let us denote by r and s two moments of time of which 
we know that < r < s < m. Points r and s determine a 
certain time interval whose length is T= s r. Let us divide 
the length of time T into n+l parts by points t= 1,2, ...,. 
These parts we shall call segments. At every moment of time 
t (t 1, 2, ..., ri) we draw from population Q t and return to 
it k t ^ 1 items, examine which values of random variable X t 
correspond to these items and calculate the arithmetic means 
of these values. In this way we get n pairs of numbers (l,^), 
(2,3c 2 ), ..., (w,3c n ). These numbers constitute a time series. If 
function \p(i) is linear it is to be expected that if we represent 



1 Published in Przeglqd Statystyczny (Statistical Review), No. 
3/4, 1958. 
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the whole time series on the graph the time curve will also 
display a linear tendency. 

The above model of a random process dependent upon 
time enables us to formulate the following definition: 

Definition 1. Function \p(i) is a trend I of the random 
variable X t depending upon time (see also [15], [27], [32]). 

Every time series can, of course, be regarded as a realization 
of the random variable X t corresponding to items from the 
hypothetical general population drawn for the sample at 
moments of time 1,2, ..., n. 

The advantage of the above definition of trend is that it 
does not use any non-statistical consepts such as "law" 
or "tendency" but employs only notions having a definite 
meaning in statistics. For this reason this definition is not 
subject to any reservation of a formal nature. 

In connection with our remarks concerning a correct defi- 
nition of trend, one important problem requires explan- 
ation. Before we describe its nature let us discuss an example 
of a time series. Suppose that we are conducting statistical 
research on the dynamics of the average sugar-beet yield per 
hectare on individually owned peasant farms in the whole 
country. Every year we draw a sample from the total number 
of farms and on the basis of the data on sugar-beet crops 
obtained from this sample we calculate the average sugar- 
beet yield on the farms drawn for the sample. In this way we 
get a time series. The function y>(t) in this example is a function 
assigning average sugar-beet yields in the whole country to 
consecutive years. 

This function is not known. We can estimate it on the 
basis of the data from the time series. It is not easy, however, 
because: 

1) statistical research can be conducted only once a year 
after harvests. Therefore, the flow of statistical data is 
very slow; 
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2) the function y>(t) is certainly not an expression of the func- 
tioning of some law and its graphical presentation will 
not be a smooth, "nice" curve. On the contrary, it can 
be expected that the curve will be irregular, "whimsical", 
will have bends and twists. 

We are now going to formulate the problem mentioned 
before. It consists in finding a way of estimating the function 
on the basis of a time series, if we know that the curve 
may have a completely irregular shape. If we reject the 
assumption which is difficult to accept, that every time series 
is governed by some law, and if we define the trend as we did 
above, then we have to recognize that the curve can have any 
shape, which means that it may also depend on the random 
factors. A line determined on the basis of time series data is 
an estimate of the shape of the curve y(t). The time series 
may be interpreted as the data from the sample taken from 
a population Q t which changes with time. When the time 
series is interpreted in this way we can speak not only of a 
trend in the population but also of a trend in the sample, which 
can be considered as an independent population. The trend 
in the sample is, of course, ihe time series curve 5c l5 # 2 > x n . 
Let us denote the trend in the sample by A(t). When the whole 
population is analysed, i.e. when the sample is identical with 
the population, then <^(0 = A(t). 

The trend of the sample is an estimate of the trend of the 
population. Let co t be a sample composed of k items drawn 
from population Q t at the moment t. Because of our assump- 
tion that items for the sample are drawn and then returned, 
the conditions of the theorem that the sequence of arithmetic 
means of the sample and the arithmetic mean of the popula- 
tion converge stochastically, are satisfied. 

On the basis of this theorem we can state that 

P{\x t -V(f)\ <*}-*!. (3) 

fc-00 

13* 
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The above formula explains how to estimate a trend of a 
population by a trend of a sample. Note that all considerations 
concerning random variable X t which depends on time, apply 
exclusively to the interval ^ t ^ m. 

Let us now consider the situation that D t contains only 
one item at a given moment of time t = 1, 2, ..., n, and the 
situation that the number of items in the sample is, at every 
moment of time, equal to the number of items in the popula- 
tion. In both these situations, of course, \p(i) =A(t). In prac- 
tice it seldom happens that the trend line is expressed by a 
simple, uncomplicated function. Hence, when we know from 
experience the values of function y(0, and this is so in both 
cases mentioned, function y)(f) can be replaced by another 
function &(f) which will be represented on the graph by a 
smooth curve, free of the irregular breaks in the curve re- 
presenting y(t). We shall call function 0(t) trend II. Let us 
define this term. Let 9(i) be a function of variable t, deter- 
mined for r < t < 5- and let H t be the statistical hypothesis 
that the values of the time series x^ x^ ..., x t at the moments 
of time re T can differ at most at random from the corresponding 
values of function 0(t). 

Definition 2. The set of functions R determined for teTand 
such that for a certain a satisfying the condition < a < 1, 
the hypothesis H t cannot be rejected, is called trend II of the 
time series x l9 * a , ..., x t , ... 

It follows from this definition that that function for which 
the deviations of the time series are of a random nature is 
a trend II. In particular a trend II is a function represented 
by a broken line passing through all points (/, x t ). 

Definition 2 poses the problem of the choice of function 
9(t). It might appear at first glance that from among the 
functions 0(f) belonging to R we should select the function 
for which the sum of absolute deviations of the values of the 
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time series from the values of (9(0 is a minimum. However, 
this is not so. 

The minimum sum of absolute deviations will belong to the 
broken line passing through all points (1,^), (2,# a ), ..., (/,*,),.... 
This sum, of course, will equal zero. However, the following 
considerations are against the choice of this function. The 
notion of the trend has been introduced into the analysis of 
time series for two reasons: 

1) because, on the assumption of a status quo, the trend 
line enables us to predict the development of phenom- 
enon to be studied in the future; 

2) because it permits us to describe functionally the devel- 
opment of this phenomenon in the past; this descrip- 
tion plays an important role in all cases in which it is 
necessary to eliminate a tendency from the time series 1 . 

It follows from point 1) that the more the scientific pre- 
dictions based on the trend line extend into the future, the 
more valuable they are in practice. From the formal point of 
view such predictions are an ordinary extrapolation of the 
trend line. Naturally, objectively justified extrapolation is 
possible only for simple functions which are represented 
graphically by continuous curves not having many fluctua- 
tions and breaks. If we were to extrapolate a curve having 
an irregular, complicated shape, we would be unable to pro- 
vide a sufficiently convincing explanation for a bend or break 
in the extrapolated part of the curve. This leads to the neces- 
sity of selecting only the simplest functions from among those 
belonging to set R. There is a contradiction between these 
two criteria for choosing function Q(f) from set R: that the 
shape of function 0(t) be as simple as possible and that the 
sum of absolute deviations of the values of the time series 



1 Such a necessity is most likely to appear when the problems studied 
deal with stationary stochastic processes and in particular with the theory 
of correlation of stationary random variables. 
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from the values of function 0(t) be as small as possible. It is 
the author's opinion that in selecting function Q(t) we have 
first to consider the requirement that the shape of the function 
be as simple as possible, and then the requirement that the 
sum of the deviations be a minimum. 

In Definition 2 it is stated that every function O(f) can 
be a trend II line if the deviations of the values of the time 
series from the values of the function are random. An anal- 
ysis whether a function belongs to set R consists in the veri- 
fication of hypothesis H t . There are many tests that can be 
used to verify hypothesis H t . We shall discuss here only 
the most useful. 

6.2. Some tests far verifying hypothesis H t 

6.2.1. The run test 

The run test (see 4.2.2.) can be used to verify hypothesis 
H r We shall demonstrate this by an example. 

Example 1. Table 1 contains monthly operating data from 
the Wroclaw Transport Corporation. The data cover a period 
of two years and are expressed in thousands of car-kilo- 
metres. 

TABLE 1 
CAR-KILOMETRES OPERATED IN WROCLAW 



No 


Car-km 
thous. 


No 


Car-km 
thous . 


No 


Car-km 
thous . 


No 


Car-km 
thous. 


1 


1,163 


7 


1,201 


13 


1,326 


19 


1,498 


2 


1,034 


8 


1,215 


14 


1,221 


20 


1,504 


3 


1,094 


9 


1,191 


15 


1,372 


21 


1,479 


4 


1,080 


10 


1,242 


16 


1,302 


22 


1,576 


5 


1,210 


11 


1,213 


17 


1,401 


23 


1,598 


6 


1,127 


12 


1,252 


18 


1,495 


24 


1,617 








14,022 








17,389 
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The time curve based on these data is shown on Graph 1. 
This curve displays a clearly marked growth tendency. 



GRAPH 1. 
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A linear function has been used to express this tendency 

X = at + b, 
in which 



(1) 



where 



</<, 



1+1 



and 



b = xat, 



(2) 
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whereas 

1 n 1 n 

x = yx t , t= yt. 

it 1 T 

When n is an even number (in practice this condition can 
always be satisfied) then 

4 /n/2 n 



In our example we should check whether the straight line 
expressed by the equation x = at-\-b is a trend II line. 

The parameters a and 6 of this line are determined by (2) 
and (3). We have 

= ^389 -14,022) =-^=23.4, 

7=12-5, ,= H022+ 17,389 = 

24 

6= 1,309 -23-4.12-5= 1,017. 

We would like to draw the reader's attention to the simplicity 
of the computations involved in the determination of para- 
meters a and b by formulae (2) and (3). 

These computations are much simpler than those required 
for the method of least squares. The method of determining 
the parameters of the trend line proposed by the author is 
a special case of the determination of the regression para- 
meters by the two-point method 1 . 

Let us denote by A an event in which the point with coor- 
dinates (t,x t ) is located above the trend line x = at+b, and 
by B an event in which such a point is located below the trend 



1 The usefulness of this method for determining trends was indi- 
cated to the author by J. Oderfeld. 
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line. We do not take into account the points located exactly 
on the trend line. In practice the last situation has no chance 
of occurring since its probability is zero. As we can see from 
the calculations and from Graph 1 in our example, the fol- 
lowing sequence of events occurred 

ABABABAABBBBABABBAAABAAA. 

The maximum length of run in this sequence is k = 4. It 
follows from Table 1, 4.2.1., that for n = 24 the value of k 
required to reject hypothesis H t at the level of significance 
a = 0-05 would have to be at least 8. This means that there 
are no grounds for rejecting the hypothesis that the values 
of the time series deviate from the line x = 23-4 /+ 1,017 
only at random. In accordance with definition 2 the line with 
this equation is a trend II line of the time series under con- 
sideration. 



6.2.2. The x* test 

The x 2 test can be used to verify hypothesis H t . Let us 
assume that if the deviations of the values of the time series 
from the line x = at+b are random, then the probability 
of a positive deviation equals the probability of a negative 
deviation, so 

P(A) = P(B)= 1/2. 

Let us denote by r the number of events A. In this case, for a 
sufficiently large sample, the distribution of the random 
variable 



approximates the x 2 distribution with one degree of freedom. 
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6.2.3. Pitman's test 

Both tests described above are very simple to use. Their 
drawback is that they react only to the sign of the deviations 
of the values of the time series from the trend line and not 
to the magnitude of these deviations. Pitman's test is free of 
this drawback. We shall now discuss this test (see [33], p. 
128-131). 

Suppose that we have two samples ft> l9 and a> 2 with frequen- 
cies m and r respectively. In sample o> 1? the sequence of values 
y^ y& y m has been obtained, and in sample o> 2 the sequence 
of values z x , z 2 , ... z r . Let us define 



v = 



The number of combinations of m+r items taken m at a time 
equals N C +r . This is the number of ways a set of m+r 
items can be divided into two subgroups numbering m and r 
items respectively. Samples o> L and co 2 form one such division 
into subgroups of m and r items. 

Let us denote by M the number of such divisions which 
have a certain property W distinguishing them from the re- 
maining NM divisions. In this case the probability of the 
occurrence of a division with property W is equal to the frac- 
tion M/N. Let us introduce the quantity 

R=\y-z\, (2) 

which we shall call the range of the division, or briefly 
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the range. Let property W consist of R > R Q where R is a 
certain positive number. Let us select R Q so that 

M/N^a, i.e. M<Afa, (3) 

where a is a real number satisfying the condition < a < 1. 
We shall consider that samples co : and co 2 come from the same 
population, i.e. that they can differ from one another only at 
random if the corresponding range R < R . Otherwise we 
shall assume that coj and co 2 come from two different popula- 
tions. 

Let us denote by y l (/ 1, 2, ...,w) the positive deviations 
of the values of the series from the probable trend line, i.e. 

y> = [x t -e(ty\>o, (4) 

and by Zj (j 1, 2, ..., r) the negative deviations of this 
series, i.e. 

-z,= [* t -0(0]<0. (5) 

The computations connected with the verification of hy- 
pothesis H t by Pitman's test will be explained on a numerical 
example. 

Example 1. The average employment in Poland in 
1949-1955 was as follows (see [65], p. 277) 

Years 1949 1950 1951 1952 1953 1954 1955 

Employment 

in tens of 

thousands 43-5 51-6 56-3 58-9 62-7 65-2 67-6 

Assuming a = 0-05 check whether the line x t = 3-5f+44 may 
be considered a trend II line. 

The deviations of the time series from a line with this equa- 
tion are given in Table 1. 
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TABLE 1 
EMPLOYMENT IN POLAND 1949-1955 



x t 


xt 


y 


z 


43-5 


47-5 




4-0 


51-6 


51-0 


0-6 




56-3 


54-5 


1-8 




58-9 


58-0 


0-9 




62-7 


61-5 


1-2 




65-2 


65-0 


0-2 




67-6 


68-5 




0-9 



The total number of the combinations in our example equals 

N = C% = 21 . 

M< 0-05. 21 = 1-05, i.e. M= 1. 



Hence 



To arrange the combinations according to declining values 
of jR we shall use the formula 

r \zv\ = \Zzrv\. (6) 

In our example r = 2, Zz = 4-9, ? = 1-39. In this case 
r |z ji | = | 4-9 -2-78 | = 2-12. 

Here are 3 out of 21 combinations for which the correspond- 
ing pairs of numbers inserted into formula (6) give a value 
not less than 2-12: 

4-0 1-8 

4-0 0-9 
4-0 0-9. 

Since to reject hypothesis H t the number of combinations 
may not be greater than M= 1, in our example there are 
no grounds for rejecting hypothesis H t . 

Pitman's test is awkward to use because of the necessity 
of finding combinations with property W. The computations 
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become cumbersome when the total number of combinations 
N is large. In such cases, however, we can use the random 
variable 

wk sn\ 



which has a distribution similar to Student's distribution with 
the number of degrees of freedom given by: 

k = m + r 2, 
where w is expressed by the equation 

w = OL (5 ~~ *)* (8) 

r s 2 

in which s denotes the standard deviation calculated on the 
basis of the data from both samples co x and co 2 . 

6.3. The determination of trend ex post and ex ante 

Let us consider two examples, taken from real life, in which 
it is necessary to determine trends. 

Example 1. It is desired to study the relationship, in a 
certain enterprise, between the amount of production outlay 
Y and the volume of production X. The purpose of this 
research is to find the regression equation of Y on X. The 
knowledge of this equation is of great practical importance 
since it enables us to assign to a given volume of production 
the expected size of outlay. The question arises, however, 
whether production and outlay, or at least one of these quan- 
tities, are not correlated with time since the efforts of the em- 
ployees are constantly concentrated on increasing production 
and lowering costs, thus creating a regular factor which would 
explain such a correlation of variables X and Y with time. 
To answer this question we have to check whether variables 
X and Y show a time tendency. There are no difficulties as far 
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as variable X, i.e. production, is concerned. The situation is 
more difficult with respect to variable 7, i.e. outlay which 
may be correlated not only with time, but also with variable X. 
If variable Y is correlated with time then instead of the re- 
gression line equation we have to calculate the regression line 
surface. When any of the variables shows a time tendency, 
then we always have to remember that it is not advisable to 
use the same regression equation for two periods which are 
far apart. As we can see, an analysis of tendencies plays an 
important part in proper research on the relationship be- 
tween outlay and production. 

Example 2. The workers in a mine have reported that they 
have extracted more coal than in the preceding month. They 
consider that this is an achievement deserving notice. It seems, 
however, that only those production effects can be regarded 
as achievements that are of a permanent nature. The workers 
of a mine can claim a worth-while achievement in increasing 
production only when the production trend is an increasing 
function of time. There is hardly any economic advantage 
to increasing production in one month in comparison with 
the preceding month if it drops considerably in the following 
month. Such fluctuations in production may be caused by 
random factors and they cannot constitute a basis for an 
appraisal or an economic decision. 

In the two examples given above the trend line was needed 
to appraise ex post the phenomenon studied. The conclusions 
reached on the basis of the trend line pertain to the past. 

Below is the procedure connected with the determination 
of the trend line needed for the analysis of a given phenom- 
enon in the past: 

1) the accumulation of statistical data for the period to 
be studied; 

2) the preparation of a time graph on the basis of accu- 
mulated statistical data; 



The regression line and the trend 207 

3) the determination by appropriate methods of the 
equation of the line which is to express the tendency 
of the time curve; 

4) the verification whether this line can be considered a trend 
line. 

If the determination of the trend line follows the order 
mentioned above then we shall regard this line as determined 
ex post. 

The situation is different when the trend line is determined 
currently as the statistical data become available, and when 
the trend appears before the statistical analysis is finished, 
and when it is used not so much for the appraisal of a given 
phenomenon in the past as for predicting its behaviour in the 
future. In this case the procedure involved in the determina- 
tion of the trend is as follows: 

1) the selection of the significance coefficient a; 

2) the determination of the minimum value of n which 
enables us to reject hypothesis H t at the level a (note: 
when the two-point method is used for the determina- 
tion of the parameters of the trend, then n has to be an 
even number). If we verify hypothesis H t by a run test, 
n= 10, when a = 0-05 (see Table 1, 4.2.2.); 

3) on the basis of n 1 consecutive points of the time 
curve 1 the equation of the straight line x = a^+b^ 
is determined and the hypothesis H ( } } formulated that 
in the interval [l,/i] the equation of the trend line is 
x = Ojt+bi. If the hypothesis is not rejected we 
formulate the hypothesis H ( f that this equation 
will be a trend line equation in the interval [!,+!]. 
We continue this procedure until there appear grounds 
for rejecting hypothesis H { $ where r is the number of 
the hypothesis at the time of rejecting it; 



or n 2 when the two-point method is used. 
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4) after rejecting hypothesis H ( $ the new line x ( $ = a r t+b r 
is determined on the basis of n 1 last points of the time 
curve and a new hypothesis is formulated; it is considered 
that since none of the remaining points provides grounds 
for rejecting hypothesis H} r ~ l) then the equation x ( t r ~~ l) == 
a r-\ f+^r-i i s th e equation of trend II of these points. 
The procedure is then repeated according to the instruc- 
tions in points 3 and 4. 

This procedure prevents us from recognizing as a trend line 
a curve which does not satisfy the condition of random devia- 
tions formulated in Definition 2. On the other hand this pro- 
cedure enables us to determine the trend line currently, with- 
out waiting until "the law governing the time series" emerges. 
The trend line obtained in this way we shall call the ex 
ante line since it is determined before the statistical analysis 
is finished. The procedure involved in the determination 
of the ex ante line is a sequential procedure. 

The ex ante trend line is composed of different straight 
line segments following each other, and the equation of this 
line is written as a sequence of linear functions corresponding 
to these straight lines. 

This is a little troublesome, but it should be remembered 
that after the accumulation of sufficient statistical data, the 
ex ante trend line can always be replaced by the ex post trend 
line. 

Example 3. Table 1 contains the data on the monthly 
production of automobiles in the United States in 1905-1928. 
The time curve based on the data from Table 1 is shown 
on Graph 1. It is a broken line shown as a thick line on the 
Graph. The thin broken line composed of two segments is 
a trend II line determined by the sequential procedure. The 
equation of the first segment of the trend line is Jc t = 3-8* 5. 
The straight line of this equation is a trend line for 1905-1912. 
The trend for the following years is the line x t = 22-5f 168. 
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TABLE 1 

MONTHLY AUTOMOBILE PRODUCTION 
IN THE USA 1905-1928 
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/ 


Years | x t in thous./month 


1 


1905 


2-1 


2 


1906 


2-8 


3 


1907 


3-7 


4 


1908 


5-4 


5 


1909 


10-9 


6 


1910 


15-6 


7 


1911 


17-5 


8 


1912 


31-5 


9 


1913 


40-4 


10 


1914 


47-4 


11 


1915 


80-8 


12 


1916 


134-8 


13 


1917 


156-2 


14 


1918 


97-6 


15 


1919 


161-1 


16 


1920 


185-6 


17 


1921 


134-7 


18 


1922 


212-0 


19 


1923 


336-2 


20 


1924 


300-2 


21 


1925 


355-5 


22 


1926 


358-4 


23 


1927 


283-4 


24 


1928 


363-2 



(see [35], pp. 193-194). 

The dotted line shown on Graph 1 is the ex post trend line. 
It is a logistic curve with the equation 

_ 320-83 

1 I _J_ gl'4925 ^ e - 0-1569 t ' 

To determine the equation of this curve the data for 1903- 
1941 were used. The statistical data for 1929-1941 are not 

14 
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shown in Table 1 because the author believes that the devia- 
tions from the trend line can only be of a random nature. 
Two events which took place in the period 1929-1941 have 
made it impossible to analyse the majority of economic phe- 
nomena by trend methods. These events were the economic 
crisis and war. 

GRAPH 1. 
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As can be seen from Graph 1, the broken line with the 
equation 

f3-8f-5 1<8, 

' \22-5t - 168 9 < t < 24, 



and the continuous line with the equation 
320-83 . 



*.= 



-0 1569* 



24 



(2) 



have approximately the same shape. Both lines can be re- 
garded as trend II lines because, according to Definition 2 
in 6.1., these lines are equivalent since they belong to set R. 
This does not mean that it is a matter of indifference which of 
these lines should be considered as more useful in practical 



The regression line and the trend 211 

applications. The great advantage of line (1) is the simplicity 
of the computations involved in the determination of its para- 
meters and the lack of any indication that there exists a law 
governing the stochastic process under consideration. Its 
drawback is that it is a broken line composed of straight line 
segments. The advantage of the line with equation (2) is that 
it is a continuous line without breaks and can be expressed 
analytically as one function. Its disadvantages are the more 
difficult computations involved in the determination of the 
parameters of its equation and a temptation to interpret the 
equation of the line as a law expressed in mathematical 
language, governing the development in time of the phe- 
nomenon studied. According to the author, both lines can be 
of service in the analysis of time series providing the notion 
of the trend is properly interpreted. 

The definition of the trend proposed in this work has im- 
portant practical implications: 

a) it introduces the concept of the set R of functions which 
can be considered as trend functions. In the interpreta- 
tion hitherto prevailing each function could be a trend 
function; 

b) it simplifies computations involved in the determina- 
tion of the parameters of the trend line; 

c) it makes possible the discovery and recognition of 
regular fluctuations (seasonal or cyclical) in the time 
series, if such fluctuations exist; they will be shown 
by a broken line determined by the sequential procedure; 

d) it enables us to reduce the random variable X t to the form 
in which this variable is not dependent upon time. This 
transformation is accomplished by the formula 



The determination of the trend is of great practical impor- 
tance to economic research because the correct knowledge 

14* 
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of economic processes is possible only when these processes 
are interpreted dynamically. Statistical methods of deter- 
mining trends are part of a branch of mathematical statistics 
known as the theory of stochastic processes which has devel- 
oped rapidly in recent years. 

It is difficult to take time into account in scientific research 
and it is not surprising that the more important achievements 
of statistics in this field are only a matter of recent years. 
As we know, correlation analysis is one of the main research 
tools used in the theory of stochastic processes. Thus new 
and broad fields for applications are opening up before the 
correlation and regression methods. This leads us to believe 
that correlation and regression theory will be studied with 
interest and, in consequence, will be further developed. 



APPENDIX 

PROOFS OF THEOREMS AND STATISTICAL DATA 
USED IN THE BOOK 

Proof of Theorem 1 from 3.3.1. 
The proof can be written as follows: 

(i) x, y ( i) y ^ 
To prove the theorem we have to show that: 



In the proof we shall use the following 
Lemma 1. 

V* ""V fl If 

A/o\ ^^ A fir v 



A' n 



Proof of the Lemma: 



1 Y* l 

> X ~ 



n-k 
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Similarly we can prove 
Lemma 2. 



y* y k 

The correctness of the theorem follows directly from Lem- 
mas 1 and 2. 

Note: a similar proof can be given for the more general 
theorem that points (x^y^), (x (8) , y^\ (x, J) are located 
on the same straight line. In this theorem the set to is 
arbitrarily divided into two subgroups, e.g. by means of 
the number x, satisfying the inequality 

*min ^ x l ^ *max 

where x mln denotes the smallest abscissa and .* max the great- 
est abscissa of the points belonging to ro. 



Appendix 
COMPUTATION TABLE FOR EXAMPLE 1 FROM 4.1.2. 
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*0 


X 


y 


x w 


~+~ 


- u 


(x-wf 


o>-)> 


(x-w)(y-u) 


+ 


_ 





4- 





1 


183 


175 


_ - 


7 




5 


49 


25 


35 




2 


184 


172 




6 




8 


36 


64 


48 




3 


180 


168 




10 




12 


100 


144 


120 




4 


164 


156 




26 




24 


676 


576 


624 




5 


177 


190 




13 


10 




169 


100 




130 


6 


159 


160 




31 




20 


961 


400 


620 




7 


147 


142 




43 




38 


1,849 


1,444 


1,634 




8 


151 


153 




39 




27 


1,521 


729 


1,053 




9 


164 


149 




26 




31 


676 


961 


806 







122 


128 




68 




52 


4,624 


2,706 


3,536 




1 


167 


167 




23 




13 


529 


169 


299 




2 


188 


172 




2 




8 


4 


64 


16 




3 


180 


162 




10 




18 


100 


324 


180 




4 


156 


160 




34 




20 


1,156 


400 


680 




5 


163 


154 




27 




26 


729 


676 


702 




6 


175 


160 




15 




20 


225 


400 


300 




7 


173 


179 




17 




1 


289 


1 


17 




8 


158 


144 




32 




36 


1,024 


1,296 


1,152 




9 


190 


193 








13 







169 








>0 


180 


169 




10 




11 


100 


121 


110 




>1 


196 


177 


6 






3 


36 


9 




18 


>2 


206 


190 


16 




10 




256 


100 


160 




>3 


199 


186 


9 




6 




81 


36 


54 




>4 


201 


180 


11 










121 








C 


15 


207 


182 


17 




2 




289 


4 


34 




16 


209 


190 


19 




10 




361 


100 


190 




11 


184 


164 




6 




16 


36 


256 


96 




>8 


165 


164 




25 




16 


625 


256 


400 




>9 


142 


149 




48 




31 


2,304 


961 


1,488 




50 


116 


133 




74 




47 


5,476 


2,209 


3,478 




51 


147 


164 




43 




16 


1,849 


256 


688 




52 


175 


168 




15 




12 


225 


144 


180 




53 


197 


176 


7 






4 


49 


16 




28 


54 


202 


186 


12 




6 




144 


36 


72 




55 


189 


183 




1 


3 




1 


9 




3 


56 


190 


176 










4 





16 





C 


57 


180 


191 




10 


11 




100 


121 




11C 



216 



Linear regression 



No 


X 


y 


x w 


y-u 


(x-w) z 


(y-iO 1 


(x w)(y u) 


+ 


- 


+ 


- 


J r 


- 


38 


170 


167 




20 




13 


400 


169 


260 




39 


182 


161 




8 




19 


64 


361 


152 




40 


189 


180 




1 








1 











41 


213 


191 


23 




11 


529 


121 


253 




42 


301 


264 


111 




84 




12,321 


7,056 


9,324 




43 


225 


202 


35 




22 




1,225 


484 


770 




44 


234 


214 


44 




34 




1,936 


1,156 


1,496 




45 


203 


184 


13 




4 




169 


16 


52 




46 


192 


189 


2 




9 




4 


81 


18 




47 


191 


179 


1 






1 


1 


1 




1 


48 


146 


155 




44 




25 


1,936 


625 


1,100 




49 


193 


173 


3 






7 


9 


49 




21 


50 


187 


166 




3 




14 


9 


196 


42 




51 


187 


182 




3 


2 




9 


4 




6 


52 


212 


205 


22 




25 




484 


625 


550 




53 


251 


216 


61 




36 




3,721 


1,296 


2,196 




54 


220 


201 


30 




21 




900 


441 


630 




55 


180 


164 




10 




16 


100 


256 


160 




56 


207 


191 


17 




11 




289 


121 


187 




57 


190 


173 










7 





49 








58 


185 


174 




5 




6 


25 


36 


30 




59 


186 


170 




4 




10 


16 


100 


40 




60 


181 


166 




9 




14 


81 


196 


126 




61 


192 


179 


2 






1 


4 


1 




2 


62 


203 


191 


13 




11 




169 


121 


143 




63 


277 


247 


87 




67 




7,569 


4,489 


5,829 




64 


299 


257 


109 




77 




11,881 


5,929 


8,393 




65 


215 


206 


25 




26 




625 


676 


650 




66 


200 


188 


10 




8 




100 


64 


80 




67 


192 


167 


2 






13 


4 


169 




26 


68 


187 


183 




3 


3 




9 


9 




9 


69 


194 


190 


4 




10 




16 


100 


40 




70 


194 


182 


4 




2 




16 


4 


8 




71 


190 


178 










2 





4 








72 


278 


241 


88 




61 




7,744 


3,721 


5,368 





|13,712 |12,888 | 803 | 771 I 595 |667 1 79,136 [ 44,024 | 56,669 [ 354 
~= 190-4, 7= 179, w = 190, u 180. 



Appendix 



217 



COMPUTATION TABLE FOR EXAMPLE 1 FROM 4.2.2. 



xr<- 






x u 


y w 






(x-rt(y-w) 


JNO 




y 


+ 


- 


+ 


- 


(* u) 


(y w) 


-\- 


- 


1 


12 


27 




108 




13 


11,664 


169 


1,404 




2 


56 


25 




64 




15 


4,096 


225 


960 




3 


65 


31 




55 




9 


3,025 


81 


495 




4 


114 


34 




6 




6 


36 


36 


36 




5 


137 


38 


17 






2 


289 


4 




34 


6 


110 


39 




10 




1 


100 


1 


10 




7 


129 


42 


9 




2 




81 


4 


18 




8 


141 


46 


21 




5 




441 


25 


105 




9 


133 


45 


13 




5 




169 


25 


65 




10 


94 


38 




26 




2 


676 


4 


52 




11 


73 


34 




47 




6 


2,209 


36 


282 




12 


85 


34 




35 




6 


1,225 


36 


210 




13 


131 


41 


11 




1 




121 


1 


11 




14 


145 


41 


25 




1 




625 


1 


25 




15 


195 


47 


75 




7 




5,625 


49 


525 




16 


200 


47 


80 




7 




6,400 


49 


560 




17 


193 


58 


73 




18 




5,329 


324 


1,314 




18 


197 


53 


77 




13 




5,929 


169 


1,001 




19 


136 


48 


16 




8 




256 


64 


128 




20 


104 


46 




16 


6 




256 


36 




96 


21 


88 


41 




32 


1 




1,024 


1 




32 


22 


107 


41 




13 


1 




169 


1 




13 



2, I 2,645 | 896 j 417 | 412 | 75 | 60 | 49,745 | 1,341 | 7,201 



x = 120-2, /= 40-7, 
u = 120, w = 40, 
(x-x) (y-y) * 7,026, 
(x-x)* fe 49,745, 

]?(y-y)*K 1,341, 
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COMPUTATION TABLE FOR EXAMPLE 1 FROM 4.3. 



No 


* 


y 


* w 


y<z> 


1 


267 


141 






2 


254 


159 






3 


249 


112 






4 


344 


152 


344 


152 


5 


246 


119 






6 


411 


207 


411 


207 


7 


217 


114 






8 


219 


11* 






9 


359 


152 


359 


152 


10 


378 


150 


378 


150 


11 


256 


135 


256 


135 


12 


406 


160 


406 


160 


13 


258 


117 






14 


213 


84 






15 


345 


129 


345 


129 


16 


273 


164 






17 


251 


76 






18 


225 


126 






19 


254 


149 






20 


194 


113 






5,619 


2,677 


2,499 


1,085 


x = 280-95, J = 133-85, x (2) = 357, ^< 2 > = 155. 



= 133-85 - 0-28 . 280-95 ^ 55. 
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220 Linear regression 

COMPUTATION TABLE FOR EXAMPLE 2 FROM 4.3. 









x u 


y w 






(x-u)(y-w) 


No 


X 


v 






(x-w) 2 


(ywf 










1 


- 


+ 


- 






+ 


- 


1 


12 


292 




108 




158 


11,664 


24,964 


17,064 




2 


56 


308 




64 




142 


4,096 


20,164 


9,088 




3 


65 


388 




55 




62 


3,025 


3,844 


3,410 




4 


114 


388 




6 




62 


36 


3,844 


372 




5 


137 


517 


17 




67 




289 


4,489 


1,139 




6 


110 


545 




10 


95 




100 


9,025 




950 


7 


129 


536 


9 




86 




81 


7,396 


774 




8 


141 


536 


21 




86 




441 


7,396 


1,806 




9 


133 


561 


13 




111 




169 


12,321 


1,443 




10 


95 


512 




25 


62 




625 


3,844 




1,550 


11 


73 


459 




47 


9 




2,209 


81 




423 


12 


62 


446 




58 




4 


3,364 


16 


232 




12 






















I 


1,127 


5,488 


60 


373 


516 


428 


26,099 


97,384 


35,328 


2,923 


1 






















13 


195 


414 


75 






36 


5,625 


1,296 




2,700 


14 


200 


463 


80 




13 




6,400 


169 


1,040 




15 


193 


448 


73 






2 


5,329 


4 




146 


16 


197 


449 


77 






1 


5,929 


1 




77 


17 


136 


435 


16 






15 


256 


225 




240 


18 


104 


373 




16 




77 


256 


5,929 


1,232 




19 


88 


361 




32 




89 


1,024 


7,921 


2,848 




20 


107 


366 




13 




84 


169 


7,056 


1,092 




20 






















S 


1,220 


3,309 


321 


61 


13 


304 


24,988 


22,601 


6,212 


3,163 


12 






















20 






















S 


2,347 


8,797 


381 


434 


529 


732 


51,087 


119,985 


41,540 


6,086 


1 
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The determination of the parameters of line I. 

x= 117-4, y = 439-9, 

u = 120, w = 450, 

4, = -2-6, 4,= -10-1, 

x - x) 0- -> = 35,454 - 20 ( - 2-6) ( - 10-1) = 34,929, 



JT (x - xf = 51,087 - 20 ( - 2-6) 2 =- 50,952, 
i 



= , = 
50,952 

b a = 439-9 - 0-69 . 117-4 = 359. 
The determination of the parameters of line IT. 
x = 93-9, y - 457-3, 

M = 120, w= 450, 

J.= -26-l, 4. = 7-3, 

- 3e) O - S) = 32,405 - 12 ( - 26-1) (7-3) - 34,691, 



( X - x) 2 - 26,099 - 12 (- 26-1) 2 = 17,953, 

i 

_ 34 ' 691 _ . 



= 457-3 - 1-93 . 93-9 = 276-5. 
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The determination of the parameters of line III. 

x = 152-5, y = 414, 

w = 120, w = 450, 

J. = 32-5, A w = - 36, 

(x - 3c) (^ - J) = 3,049 - 8 (32-5) ( - 36) = 12,409, 



12 

20 



( X _ jf) 2 = 24,988 - 8 (32-5) 2 = 16,538, 

12 

12,409 . __ 
a z i = -- 0-75, 
16,538 

b. a = 414 - 0-75 . 152-5 = 300. 



LIST OF SYMBOLS 



Symbol 
a 

021 
012 

a 

021 class 
021 point 



Ac 

b 
C 



D 

Ar 

d 
D 



The meaning of the symbol 

the significance level 

the regression coefficient of Y on X in a pop- 

ulation 
the regression coefficient of X on Y in a pop- 

ulation 

the estimate of a regression coefficient 
the estimate a 21 obtained by the method of least 

squares 
the estimate cr al obtained by the two-point 

method 

the /~ th commodity 
; a constant term in the equation of the linear 

regression of Y on X in a population 
a constant term in the equation of the linear 

regression of X on Y in a population 
the estimate of the constant term 
consumption 

covariance in a population 
covariance in a sample 
set of events 



the class interval of X 
demand for commodity 



Page* 

147 
29 

29 
101 

138 

138 
67 

29 

29 
101 

62 

23 
102 

12 

139 

116 
67 



* Page on which the symbol was used for the first time. 
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Linear regression 



Symbol 



The meaning of the symbol 



E 1 
E* 
e 
e 



MX) 



f(y\x) 
<r(x,y) 



fft 

ffr 



k 
k 



increase in consumption 

the /-to residual 

energy produced 

energy used up 

non-negative constant 

the relative efficiency of the regression coefficient 
obtained by the two-point method 

the density function of a two-dimensional 
random variable 

marginal density 

two-dimensional distribution function 

the conditional density function 

the density of the two-dimensional normal 
distribution 

rotation angle in a sample 

regression of Y on X 

regression of X on Y 

hypothesis that the regression line in the popul- 
ation is a straight line 

hypothesis that the line belongs to set R 

hypothesis that the regression lines obtained 
by the classical method and the two-point 
method do not differ significantly 

class interval of Y 

correlation ratios 

the coefficient of efficiency 

the number of classes in the frequency 
distribution of X 

the number of points with abscissa greater 
than x 

the number of classes in the frequency distri- 
bution of Y 

the slope of the orthogonal regression line 
in a population 

E(X) 

E(Y) 



Page 



69 

32 

79 

79 

138 

139 

18 
18 
17 
20 

43 
149 

25 
25 

159 
196 



171 
116 

37 
79 

115 
125 
115 

42 
43 
43 
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Symbol 



m n 



t*ik 
/'oi 

/<20 



n 

N 



O) 

e 



PI- 
p(y 
p(x l \y j ) 

p 

Q 

r 



The meaning of the symbol 



EfX) 
E(Y) 
E(XY) 



the number of points with abscissa 

greater than y 

frequency in the contingency table 
| the number of variables 
the size of the sample 
population 
normal distribution with parameters m = 0, 

a--- 1 

general population 
sample 

the rotation angle of a population 
probability that X--=x l and Y- y t 
marginal probability 

conditional probability 

the /~ th need 
price 

two-dimensional space 
the correlation coefficient in a population 
sample 

the standard error of estimation in a population 

> > > > ' 
the standard deviation in a population 



Page 

21 
22 
22 
22 
22 
22 
22 
22 
22 
23 
23 
23 

127 
115 

12 
100 

69 

139 
100 
100 
42 
13 
14 
14 
15 
15 
56 
71 
13 
38 
103 
33 
33 
43 
43 
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Symbol 



s 

S" 

s 



The meaning of the symbol 

the standard error of estimation in a sample 

(random variable) 
the standard error of estimation in a sample 

(realization of random variable) 
the standard error of estimation in a sample 

(realization of random variable) 
the standard deviation in a sample 

(realization of random variable) 
the standard deviation in a sample 

(random variable) 
the standard deviation in a sample 

(realization of random variable) 
the standard deviation in a sample 

(random variable) 
the sum of financial resources 
the sum of free decisions 
supply 



V/"~2 



Page 



167 
102 
102 
102 
166 
102 

166 
56 
57 
76 

176 

166 
177 

167 



t 

U 

U 

u 



>V 



1 + 



(y-y) 



time 

Y **t 

revenue 

an arbitrary constant introduced to simplify 
calculations 

the empirical frequency distribution of var- 
iable X f 



168 

67 
33 
78 

114 
146 
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Symbol 


The meaning of the symbol 


Page 


v7/> 


the empirical frequency distribution of var- 






iable y 


146 


W 


y 


33 


W, 


the consumption of commodity Ai 


72 


w 


an arbitrary constant introduced to facilitate 






calculations 


114 


X 


production 


78 


X 


random variable 


12 


X 


regression in a population 


25 


X ( D 


X\X<x 


125 


X (2 ) 


X\X> x 


125 


x <t> 


X\Y<y 


127 


x <2> 


X\Y> y 


127 




1 




*<D 


. .1 ^ 


125 




1 




*(2) 




125 




1 








177 


X<i> 


n m 


iz. / 




1 




X<2> 




127 




m 




X 


l -z Xi 


101 


X' 


(X - w 3 ) cos B 4- (Y - w 2 ) sin & 


145 


X 


the trend line in a sample 


188 





the two-dimensional random variable (AT,, X 2 ) 


12 





(X, Y) 


12 





w-dimensional random variable (X l9 X 2 ,...,Xn) 


12 


Y 


cost 


81 


yr 


fixed cost 


82 


y 


variable cost 


82 


Y 


average cost 


82 


Y' 


marginal cost 


82 


Y' 


- (X - /nj sin 9 -\- (Y - w 2 ) cos & 


145 


y 


regression in a sample 


101 


y 


population 


25 



15* 
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Symbol 



The meaning of the symbol 



// m 
I 



n m 

.* 

m 
1 



profit 

T"T^7 



Page 

127 
127 
125 
125 

127 
127 
125 
125 
101 
78 
177 



TABLES 

TABLE 1 
NORMAL DISTRIBUTION 



(/) 



dv 






* 









'. ' 


t 


*> 


t 


0(0 


t 


<Z>(0 


0-00 


o-oooo 


1-00 


0-3413 


2-00 


0-4773 


0-05 


0-0199 


1-05 


0-3531 


2-05 


0-4798 


0-10 


0-0398 


MO 


0-3643 


2-10 


0-4821 


0-15 


0-0596 


1-15 


0-3749 


2-15 


0-4842 


0-20 


0-0793 


1-20 


0-3849 


2-20 


0-4861 


0-25 


0-0987 


1-25 


0-3944 


2-25 


0-4878 


0-30 


0-1179 


1-30 


0-4032 


2-30 


0-4893 


0-35 


0-1368 


1-35 


0-4115 


2-35 


0-4906 


0-40 


0-1554 


40 


0-4192 


2-40 


0-4918 


0-45 


0-1736 


45 


0-4265 


2-45 


0-4929 


0-50 


0-1915 


50 


0-4332 


2-50 


0-4938 


0-55 


0-2088 


55 


0-4394 


2-55 


0-4946 


0-60 


0-2257 


60 


0-4452 


2-60 


0-4953 


0-65 


0-2422 


65 


0-4505 


2-65 


0-4960 


0-70 


0-2580 


70 


0-4554 


2-70 


0-4965 


0-75 


0-2734 


1-75 


0-4599 


2-75 


0-4970 


0-80 


0-2881 


1-80 


0-4641 


2-80 


0-4974 


0-85 


0-3023 


1-85 


0-4678 


2-85 


0-4978 


0-90 


0-3159 


1-90 


0-4713 


2-90 


04981 


0-95 


0-3289 


1-95 


0-4744 


2-95 


0-4984 










3-00 


0-4987 
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TABLE 4 
NORMAL DISTRIBUTION (Density function) 
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